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Preface 


New for the Second Edition 

The first edition of this book was published in 2012, during a time when open source 
data analysis libraries for Python (such as pandas) were very new and developing rap¬ 
idly. In this updated and expanded second edition, I have overhauled the chapters to 
account both for incompatible changes and deprecations as well as new features that 
have occurred in the last five years. I’ve also added fresh content to introduce tools 
that either did not exist in 2012 or had not matured enough to make the first cut. 
Finally, I have tried to avoid writing about new or cutting-edge open source projects 
that may not have had a chance to mature. I would like readers of this edition to find 
that the content is still almost as relevant in 2020 or 2021 as it is in 2017. 

The major updates in this second edition include: 

• All code, including the Python tutorial, updated for Python 3.6 (the first edition 
used Python 2.7) 

• Updated Python installation instructions for the Anaconda Python Distribution 
and other needed Python packages 

• Updates for the latest versions of the pandas library in 2017 

• A new chapter on some more advanced pandas tools, and some other usage tips 

• A brief introduction to using statsmodels and scikit-learn 

I also reorganized a significant portion of the content from the first edition to make 
the book more accessible to newcomers. 
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Conventions Used in This Book 

The following typographical conventions are used in this book: 

Italic 

Indicates new terms, URLs, email addresses, filenames, and file extensions. 
Constant width 

Used for program listings, as well as within paragraphs to refer to program ele¬ 
ments such as variable or function names, databases, data types, environment 
variables, statements, and keywords. 

Constant width bold 

Shows commands or other text that should be typed literally by the user. 

Constant width italic 

Shows text that should be replaced with user-supplied values or by values deter¬ 
mined by context. 



This element signifies a tip or suggestion. 



This element signifies a general note. 



This element indicates a warning or caution. 


Using Code Examples 

You can find data files and related material for each chapter is available in this book’s 
GitHub repository at http://github.com/wesm/pydata-book. 

This book is here to help you get your job done. In general, if example code is offered 
with this book, you may use it in your programs and documentation. You do not 
need to contact us for permission unless you’re reproducing a significant portion of 
the code. For example, writing a program that uses several chunks of code from this 
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book does not require permission. Selling or distributing a CD-ROM of examples 
from O’Reilly books does require permission. Answering a question by citing this 
book and quoting example code does not require permission. Incorporating a signifi¬ 
cant amount of example code from this book into your product’s documentation does 
require permission. 

We appreciate, but do not require, attribution. An attribution usually includes the 
title, author, publisher, and ISBN. For example: “Python for Data Analysis by Wes 
McKinney (O’Reilly). Copyright 2017 Wes McKinney, 978-1-491-95766-0.” 

If you feel your use of code examples falls outside fair use or the permission given 
above, feel free to contact us at permissions@oreilly.com. 


O'Reilly Safari 


^Safari 


Safari (formerly Safari Books Online) is a membership-based 
training and reference platform for enterprise, government, 
educators, and individuals. 


Members have access to thousands of books, training videos, Learning Paths, interac¬ 
tive tutorials, and curated playlists from over 250 publishers, including O’Reilly 
Media, Harvard Business Review, Prentice Hall Professional, Addison-Wesley Profes¬ 
sional, Microsoft Press, Sams, Que, Peachpit Press, Adobe, Focal Press, Cisco Press, 
John Wiley & Sons, Syngress, Morgan Kaufmann, IBM Redbooks, Packt, Adobe 
Press, FT Press, Apress, Manning, New Riders, McGraw-Hill, Jones & Bartlett, and 
Course Technology, among others. 

For more information, please visit http://oreilly.com/safari. 


How to Contact Us 


Please address comments and questions concerning this book to the publisher: 

O’Reilly Media, Inc. 

1005 Gravenstein Highway North 
Sebastopol, CA 95472 

800-998-9938 (in the United States or Canada) 

707-829-0515 (international or local) 

707-829-0104 (fax) 

We have a web page for this book, where we list errata, examples, and any additional 
information. You can access this page at http://bit.ly/python_data_analysis_2e. 
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To comment or ask technical questions about this book, send email to bookques- 
tions@oreilly.com. 

For more information about our books, courses, conferences, and news, see our web¬ 
site at http://www.oreilly.com. 

Find us on Facebook: http://facebook.com/oreilly 

Follow us on Twitter: http://twitter.com/oreillymedia 

Watch us on YouTube: http://www.youtube.com/oreillymedia 
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CHAPTER 1 


Preliminaries 


1.1 What Is This Book About? 

This book is concerned with the nuts and bolts of manipulating, processing, cleaning, 
and crunching data in Python. My goal is to offer a guide to the parts of the Python 
programming language and its data-oriented library ecosystem and tools that will 
equip you to become an effective data analyst. While “data analysis” is in the title of 
the book, the focus is specifically on Python programming, libraries, and tools as 
opposed to data analysis methodology. This is the Python programming you need/or 
data analysis. 

What Kinds of Data? 

When I say “data,” what am I referring to exactly? The primary focus is on structured 
data, a deliberately vague term that encompasses many different common forms of 
data, such as: 

• Tabular or spreadsheet-like data in which each column may be a different type 
(string, numeric, date, or otherwise). This includes most kinds of data commonly 
stored in relational databases or tab- or comma-delimited text files. 

• Multidimensional arrays (matrices). 

• Multiple tables of data interrelated by key columns (what would be primary or 
foreign keys for a SQL user). 

• Evenly or unevenly spaced time series. 

This is by no means a complete list. Even though it may not always be obvious, a large 
percentage of datasets can be transformed into a structured form that is more suitable 
for analysis and modeling. If not, it may be possible to extract features from a dataset 
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into a structured form. As an example, a collection of news articles could be pro¬ 
cessed into a word frequency table, which could then be used to perform sentiment 
analysis. 

Most users of spreadsheet programs like Microsoft Excel, perhaps the most widely 
used data analysis tool in the world, will not be strangers to these kinds of data. 

1.2 Why Python for Data Analysis? 

For many people, the Python programming language has strong appeal. Since its first 
appearance in 1991, Python has become one of the most popular interpreted pro¬ 
gramming languages, along with Perl, Ruby, and others. Python and Ruby have 
become especially popular since 2005 or so for building websites using their numer¬ 
ous web frameworks, like Rails (Ruby) and Django (Python). Such languages are 
often called scripting languages, as they can be used to quickly write small programs, 
or scripts to automate other tasks. I don’t like the term “scripting language,” as it car¬ 
ries a connotation that they cannot be used for building serious software. Among 
interpreted languages, for various historical and cultural reasons, Python has devel¬ 
oped a large and active scientific computing and data analysis community. In the last 
10 years, Python has gone from a bleeding-edge or “at your own risk” scientific com¬ 
puting language to one of the most important languages for data science, machine 
learning, and general software development in academia and industry. 

For data analysis and interactive computing and data visualization, Python will inevi¬ 
tably draw comparisons with other open source and commercial programming lan¬ 
guages and tools in wide use, such as R, MATLAB, SAS, Stata, and others. In recent 
years, Python’s improved support for libraries (such as pandas and scikit-learn) has 
made it a popular choice for data analysis tasks. Combined with Python’s overall 
strength for general-purpose software engineering, it is an excellent option as a pri¬ 
mary language for building data applications. 

Python as Glue 

Part of Python’s success in scientific computing is the ease of integrating C, C++, and 
FORTRAN code. Most modern computing environments share a similar set of legacy 
FORTRAN and C libraries for doing linear algebra, optimization, integration, fast 
Fourier transforms, and other such algorithms. The same story has held true for 
many companies and national labs that have used Python to glue together decades’ 
worth of legacy software. 

Many programs consist of small portions of code where most of the time is spent, 
with large amounts of “glue code” that doesn’t run often. In many cases, the execution 
time of the glue code is insignificant; effort is most fruitfully invested in optimizing 
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the computational bottlenecks, sometimes by moving the code to a lower-level lan¬ 
guage like C. 

Solving the "Two-Language" Problem 

In many organizations, it is common to research, prototype, and test new ideas using 
a more specialized computing language like SAS or R and then later port those ideas 
to be part of a larger production system written in, say, Java, C#, or C++. What people 
are increasingly finding is that Python is a suitable language not only for doing 
research and prototyping but also for building the production systems. Why main¬ 
tain two development environments when one will suffice? I believe that more and 
more companies will go down this path, as there are often significant organizational 
benefits to having both researchers and software engineers using the same set of pro¬ 
gramming tools. 

Why Not Python? 

While Python is an excellent environment for building many kinds of analytical 
applications and general-purpose systems, there are a number of uses for which 
Python may be less suitable. 

As Python is an interpreted programming language, in general most Python code will 
run substantially slower than code written in a compiled language like Java or C++. 
As programmer time is often more valuable than CPU time , many are happy to make 
this trade-off. However, in an application with very low latency or demanding 
resource utilization requirements (e.g., a high-frequency trading system), the time 
spent programming in a lower-level (but also lower-productivity) language like C++ 
to achieve the maximum possible performance might be time well spent. 

Python can be a challenging language for building highly concurrent, multithreaded 
applications, particularly applications with many CPU-bound threads. The reason for 
this is that it has what is known as the global interpreter lock (GIL), a mechanism that 
prevents the interpreter from executing more than one Python instruction at a time. 
The technical reasons for why the GIL exists are beyond the scope of this book. While 
it is true that in many big data processing applications, a cluster of computers may be 
required to process a dataset in a reasonable amount of time, there are still situations 
where a single-process, multithreaded system is desirable. 

This is not to say that Python cannot execute truly multithreaded, parallel code. 
Python C extensions that use native multithreading (in C or C++) can run code in 
parallel without being impacted by the GIL, so long as they do not need to regularly 
interact with Python objects. 
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1.3 Essential Python Libraries 

For those who are less familiar with the Python data ecosystem and the libraries used 
throughout the book, I will give a brief overview of some of them. 

NumPy 

NumPy, short for Numerical Python, has long been a cornerstone of numerical com¬ 
puting in Python. It provides the data structures, algorithms, and library glue needed 
for most scientific applications involving numerical data in Python. NumPy contains, 
among other things: 

• A fast and efficient multidimensional array object ndarray 

• Functions for performing element-wise computations with arrays or mathemati¬ 
cal operations between arrays 

• Tools for reading and writing array-based datasets to disk 

• Linear algebra operations, Fourier transform, and random number generation 

• A mature C API to enable Python extensions and native C or C++ code to access 
NumPy’s data structures and computational facilities 

Beyond the fast array-processing capabilities that NumPy adds to Python, one of its 
primary uses in data analysis is as a container for data to be passed between algo¬ 
rithms and libraries. For numerical data, NumPy arrays are more efficient for storing 
and manipulating data than the other built-in Python data structures. Also, libraries 
written in a lower-level language, such as C or Fortran, can operate on the data stored 
in a NumPy array without copying data into some other memory representation. 
Thus, many numerical computing tools for Python either assume NumPy arrays as a 
primary data structure or else target seamless interoperability with NumPy. 

pandas 

pandas provides high-level data structures and functions designed to make working 
with structured or tabular data fast, easy, and expressive. Since its emergence in 2010, 
it has helped enable Python to be a powerful and productive data analysis environ¬ 
ment. The primary objects in pandas that will be used in this book are the DataFrame, 
a tabular, column-oriented data structure with both row and column labels, and the 
Series, a one-dimensional labeled array object. 

pandas blends the high-performance, array-computing ideas of NumPy with the flex¬ 
ible data manipulation capabilities of spreadsheets and relational databases (such as 
SQL). It provides sophisticated indexing functionality to make it easy to reshape, slice 
and dice, perform aggregations, and select subsets of data. Since data manipulation, 
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preparation, and cleaning is such an important skill in data analysis, pandas is one of 
the primary focuses of this book. 

As a bit of background, I started building pandas in early 2008 during my tenure at 
AQR Capital Management, a quantitative investment management firm. At the time, 
I had a distinct set of requirements that were not well addressed by any single tool at 
my disposal: 

• Data structures with labeled axes supporting automatic or explicit data alignment 
—this prevents common errors resulting from misaligned data and working with 
differently indexed data coming from different sources 

• Integrated time series functionality 

• The same data structures handle both time series data and non-time series data 

• Arithmetic operations and reductions that preserve metadata 

• Flexible handling of missing data 

• Merge and other relational operations found in popular databases (SQL-based, 
for example) 

I wanted to be able to do all of these things in one place, preferably in a language well 
suited to general-purpose software development. Python was a good candidate lan¬ 
guage for this, but at that time there was not an integrated set of data structures and 
tools providing this functionality. As a result of having been built initially to solve 
finance and business analytics problems, pandas features especially deep time series 
functionality and tools well suited for working with time-indexed data generated by 
business processes. 

For users of the R language for statistical computing, the DataFrame name will be 
familiar, as the object was named after the similar R data.frame object. Unlike 
Python, data frames are built into the R programming language and its standard 
library. As a result, many features found in pandas are typically either part of the R 
core implementation or provided by add-on packages. 

The pandas name itself is derived from panel data, an econometrics term for multidi¬ 
mensional structured datasets, and a play on the phrase Python data analysis itself. 

matplotlib 

matplotlib is the most popular Python library for producing plots and other two- 
dimensional data visualizations. It was originally created by John D. Hunter and is 
now maintained by a large team of developers. It is designed for creating plots suit¬ 
able for publication. While there are other visualization libraries available to Python 
programmers, matplotlib is the most widely used and as such has generally good inte- 
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gration with the rest of the ecosystem. I think it is a safe choice as a default visualiza¬ 
tion tool. 

IPython and Jupyter 

The IPython project began in 2001 as Fernando Perezs side project to make a better 
interactive Python interpreter. In the subsequent 16 years it has become one of the 
most important tools in the modern Python data stack. While it does not provide any 
computational or data analytical tools by itself, IPython is designed from the ground 
up to maximize your productivity in both interactive computing and software devel¬ 
opment. It encourages an execute-explore workflow instead of the typical edit-compile- 
run workflow of many other programming languages. It also provides easy access to 
your operating system’s shell and filesystem. Since much of data analysis coding 
involves exploration, trial and error, and iteration, IPython can help you get the job 
done faster. 

In 2014, Fernando and the IPython team announced the Jupyter project, a broader 
initiative to design language-agnostic interactive computing tools. The IPython web 
notebook became the Jupyter notebook, with support now for over 40 programming 
languages. The IPython system can now be used as a kernel (a programming language 
mode) for using Python with Jupyter. 

IPython itself has become a component of the much broader Jupyter open source 
project, which provides a productive environment for interactive and exploratory 
computing. Its oldest and simplest “mode” is as an enhanced Python shell designed to 
accelerate the writing, testing, and debugging of Python code. You can also use the 
IPython system through the Jupyter Notebook, an interactive web-based code “note¬ 
book” offering support for dozens of programming languages. The IPython shell and 
Jupyter notebooks are especially useful for data exploration and visualization. 

The Jupyter notebook system also allows you to author content in Markdown and 
HTML, providing you a means to create rich documents with code and text. Other 
programming languages have also implemented kernels for Jupyter to enable you to 
use languages other than Python in Jupyter. 

For me personally, IPython is usually involved with the majority of my Python work, 
including running, debugging, and testing code. 

In the accompanying book materials, you will find Jupyter notebooks containing all 
the code examples from each chapter. 

SciPy 

SciPy is a collection of packages addressing a number of different standard problem 
domains in scientific computing. Here is a sampling of the packages included: 
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scipy.integrate 

Numerical integration routines and differential equation solvers 
scipy.linalg 

Linear algebra routines and matrix decompositions extending beyond those pro¬ 
vided in numpy.linalg 

scipy.optimize 

Function optimizers (minimizers) and root finding algorithms 

scipy.signal 

Signal processing tools 

scipy.sparse 

Sparse matrices and sparse linear system solvers 
scipy.special 

Wrapper around SPECFUN, a Fortran library implementing many common 
mathematical functions, such as the gamma function 

scipy.stats 

Standard continuous and discrete probability distributions (density functions, 
samplers, continuous distribution functions), various statistical tests, and more 
descriptive statistics 

Together NumPy and SciPy form a reasonably complete and mature computational 
foundation for many traditional scientific computing applications. 

scikit-learn 

Since the project’s inception in 2010, scikit-learn has become the premier general- 
purpose machine learning toolkit for Python programmers. In just seven years, it has 
had over 1,500 contributors from around the world. It includes submodules for such 
models as: 

• Classification: SVM, nearest neighbors, random forest, logistic regression, etc. 

• Regression: Lasso, ridge regression, etc. 

• Clustering: fc-means, spectral clustering, etc. 

• Dimensionality reduction: PCA, feature selection, matrix factorization, etc. 

• Model selection: Grid search, cross-validation, metrics 

• Preprocessing: Feature extraction, normalization 

Along with pandas, statsmodels, and IPython, scikit-learn has been critical for ena¬ 
bling Python to be a productive data science programming language. While I won’t 
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be able to include a comprehensive guide to scikit-learn in this book, I will give a 
brief introduction to some of its models and how to use them with the other tools 
presented in the book. 

statsmodels 

statsmodels is a statistical analysis package that was seeded by work from Stanford 
University statistics professor Jonathan Taylor, who implemented a number of regres¬ 
sion analysis models popular in the R programming language. Skipper Seabold and 
Josef Perktold formally created the new statsmodels project in 2010 and since then 
have grown the project to a critical mass of engaged users and contributors. Nathaniel 
Smith developed the Patsy project, which provides a formula or model specification 
framework for statsmodels inspired by R’s formula system. 

Compared with scikit-learn, statsmodels contains algorithms for classical (primarily 
frequentist) statistics and econometrics. This includes such submodules as: 

• Regression models: Linear regression, generalized linear models, robust linear 
models, linear mixed effects models, etc. 

• Analysis of variance (ANOVA) 

• Time series analysis: AR, ARMA, ARIMA, VAR, and other models 

• Nonparametric methods: Kernel density estimation, kernel regression 

• Visualization of statistical model results 

statsmodels is more focused on statistical inference, providing uncertainty estimates 
and p-values for parameters, scikit-learn, by contrast, is more prediction-focused. 

As with scikit-learn, I will give a brief introduction to statsmodels and how to use it 
with NumPy and pandas. 

1.4 Installation and Setup 

Since everyone uses Python for different applications, there is no single solution for 
setting up Python and required add-on packages. Many readers will not have a com¬ 
plete Python development environment suitable for following along with this book, 
so here I will give detailed instructions to get set up on each operating system. I rec¬ 
ommend using the free Anaconda distribution. At the time of this writing, Anaconda 
is offered in both Python 2.7 and 3.6 forms, though this might change at some point 
in the future. This book uses Python 3.6, and I encourage you to use Python 3.6 or 
higher. 
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Windows 

To get started on Windows, download the Anaconda installer. I recommend follow¬ 
ing the installation instructions for Windows available on the Anaconda download 
page, which may have changed between the time this book was published and when 
you are reading this. 

Now, let’s verify that things are configured correctly. To open the Command Prompt 
application (also known as cmd.exe), right-click the Start menu and select Command 
Prompt. Try starting the Python interpreter by typing python. You should see a mes¬ 
sage that matches the version of Anaconda you installed: 

C:\Users\wesm>python 

Python 3.5.2 |Anaconda 4.1.1 (64-btt)| (default. Jut 5 2016, 11:41:13) 

[MSC v. 1900 64 bit (AMD64)] on Win32 
»> 

To exit the shell, press Ctrl-D (on Linux or macOS), Ctrl-Z (on Windows), or type 
the command exit() and press Enter. 

Apple (OS X, macOS) 

Download the OS X Anaconda installer, which should be named something like 
Anaconda3-4.LO-MacOSX-x86_64.pkg. Double-click the .pkg file to run the installer. 
When the installer runs, it automatically appends the Anaconda executable path to 
yom .bash_profile file. This is located at /Users/$USER/.bash_profile. 

To verify everything is working, try launching IPython in the system shell (open the 
Terminal application to get a command prompt): 

$ ipython 

To exit the shell, press Ctrl-D or type exit() and press Enter. 

GNU/Linux 

Linux details will vary a bit depending on your Linux flavor, but here I give details for 
such distributions as Debian, Ubuntu, CentOS, and Fedora. Setup is similar to OS X 
with the exception of how Anaconda is installed. The installer is a shell script that 
must be executed in the terminal. Depending on whether you have a 32-bit or 64-bit 
system, you will either need to install the x86 (32-bit) or x86_64 (64-bit) installer. You 
will then have a file named something similar to Anaconda3-4.1.0-Linux-x86_64.sh. 
To install it, execute this script with bash: 

$ bash Anaconda3-4.1.0-Ltnux-x86_64.sh 
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Some Linux distributions have versions of all the required Python 
packages in their package managers and can be installed using a 
tool like apt. The setup described here uses Anaconda, as it’s both 
easily reproducible across distributions and simpler to upgrade 
packages to their latest versions. 


After accepting the license, you will be presented with a choice of where to put the 
Anaconda files. I recommend installing the files in the default location in your home 
directory—for example, /home/$USER/anaconda (with your username, naturally). 

The Anaconda installer may ask if you wish to prepend its bin/ directory to your 
$PATH variable. If you have any problems after installation, you can do this yourself by 
modifying your .bashrc (or .zshrc, if you are using the zsh shell) with something akin 
to: 


export PATH=/home/$USER/anaconda/bin:$PATH 

After doing this you can either start a new terminal process or execute your .bashrc 
again with source -/.bashrc. 

Installing or Updating Python Packages 

At some point while reading, you may wish to install additional Python packages that 
are not included in the Anaconda distribution. In general, these can be installed with 
the following command: 

conda install package_nane 

If this does not work, you may also be able to install the package using the pip pack¬ 
age management tool: 

pip Install package_ncme 

You can update packages by using the conda update command: 
conda update package_nane 

pip also supports upgrades using the - - upgrade flag: 
pip install --upgrade package_ntme 

You will have several opportunities to try out these commands throughout the book. 



While you can use both conda and pip to install packages, you 
should not attempt to update conda packages with pip, as doing so 
can lead to environment problems. When using Anaconda or Min- 
iconda, it’s best to first try updating with conda. 
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Python 2 and Python 3 

The first version of the Python 3.x line of interpreters was released at the end of 2008. 
It included a number of changes that made some previously written Python 2.x code 
incompatible. Because 17 years had passed since the very first release of Python in 
1991, creating a “breaking” release of Python 3 was viewed to be for the greater good 
given the lessons learned during that time. 

In 2012, much of the scientific and data analysis community was still using Python 
2.x because many packages had not been made fully Python 3 compatible. Thus, the 
first edition of this book used Python 2.7. Now, users are free to choose between 
Python 2.x and 3.x and in general have full library support with either flavor. 

However, Python 2.x will reach its development end of life in 2020 (including critical 
security patches), and so it is no longer a good idea to start new projects in Python 
2.7. Therefore, this book uses Python 3.6, a widely deployed, well-supported stable 
release. We have begun to call Python 2.x “Legacy Python” and Python 3.x simply 
“Python.” I encourage you to do the same. 

This book uses Python 3.6 as its basis. Your version of Python may be newer than 3.6, 
but the code examples should be forward compatible. Some code examples may work 
differently or not at all in Python 2.7. 

Integrated Development Environments (IDEs) and Text Editors 

When asked about my standard development environment, I almost always say “IPy- 
thon plus a text editor.” I typically write a program and iteratively test and debug each 
piece of it in IPython or Jupyter notebooks. It is also useful to be able to play around 
with data interactively and visually verify that a particular set of data manipulations is 
doing the right thing. Libraries like pandas and NumPy are designed to be easy to use 
in the shell. 

When building software, however, some users may prefer to use a more richly fea¬ 
tured IDE rather than a comparatively primitive text editor like Emacs or Vim. Here 
are some that you can explore: 

• PyDev (free), an IDE built on the Eclipse platform 

• PyCharm from JetBrains (subscription-based for commercial users, free for open 
source developers) 

• Python Tools for Visual Studio (for Windows users) 

• Spyder (free), an IDE currently shipped with Anaconda 

• Komodo IDE (commercial) 
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Due to the popularity of Python, most text editors, like Atom and Sublime Text 2, 
have excellent Python support. 

1.5 Community and Conferences 

Outside of an internet search, the various scientific and data-related Python mailing 
lists are generally helpful and responsive to questions. Some to take a look at include: 

• pydata: A Google Group list for questions related to Python for data analysis and 
pandas 

• pystatsmodels: For statsmodels or pandas-related questions 

• Mailing list for scikit-learn ( scikit-learn@python.org ) and machine learning in 
Python, generally 

• numpy-discussion: For NumPy-related questions 

• scipy-user: For general SciPy or scientific Python questions 

I deliberately did not post URLs for these in case they change. They can be easily 
located via an internet search. 

Each year many conferences are held all over the world for Python programmers. If 
you would like to connect with other Python programmers who share your interests, 
I encourage you to explore attending one, if possible. Many conferences have finan¬ 
cial support available for those who cannot afford admission or travel to the confer¬ 
ence. Here are some to consider: 

• PyCon and EuroPython: The two main general Python conferences in North 
America and Europe, respectively 

• SciPy and EuroSciPy: Scientific-computing-oriented conferences in North Amer¬ 
ica and Europe, respectively 

• PyData: A worldwide series of regional conferences targeted at data science and 
data analysis use cases 

• International and regional PyCon conferences (see http://pycon.org for a com¬ 
plete listing) 

1.6 Navigating This Book 

If you have never programmed in Python before, you will want to spend some time in 
Chapters 2 and 3, where I have placed a condensed tutorial on Python language fea¬ 
tures and the IPython shell and Jupyter notebooks. These things are prerequisite 
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knowledge for the remainder of the book. If you have Python experience already, you 
may instead choose to skim or skip these chapters. 

Next, I give a short introduction to the key features of NumPy, leaving more 
advanced NumPy use for Appendix A. Then, I introduce pandas and devote the rest 
of the book to data analysis topics applying pandas, NumPy, and matplotlib (for visu¬ 
alization). I have structured the material in the most incremental way possible, 
though there is occasionally some minor cross-over between chapters, with a few iso¬ 
lated cases where concepts are used that haven’t necessarily been introduced yet. 

While readers may have many different end goals for their work, the tasks required 
generally fall into a number of different broad groups: 

Interacting with the outside world 

Reading and writing with a variety of file formats and data stores 

Preparation 

Cleaning, munging, combining, normalizing, reshaping, slicing and dicing, and 
transforming data for analysis 

Transformation 

Applying mathematical and statistical operations to groups of datasets to derive 
new datasets (e.g., aggregating a large table by group variables) 

Modeling and computation 

Connecting your data to statistical models, machine learning algorithms, or other 
computational tools 

Presentation 

Creating interactive or static graphical visualizations or textual summaries 

Code Examples 

Most of the code examples in the book are shown with input and output as it would 
appear executed in the IPython shell or in Jupyter notebooks: 

In [5]: CODE EXAMPLE 
0ut[5]: OUTPUT 

When you see a code example like this, the intent is for you to type in the example 
code in the In block in your coding environment and execute it by pressing the Enter 
key (or Shift-Enter in Jupyter). You should see output similar to what is shown in the 
Out block. 

Data for Examples 

Datasets for the examples in each chapter are hosted in a GitHub repository. You can 
download this data either by using the Git version control system on the command 
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line or by downloading a zip file of the repository from the website. If you run into 
problems, navigate to my website for up-to-date instructions about obtaining the 
book materials. 

I have made every effort to ensure that it contains everything necessary to reproduce 
the examples, but I may have made some mistakes or omissions. If so, please send me 
an email: book@wesmckinney.com. The best way to report errors in the book is on the 
errata page on the O’Reilly website. 

Import Conventions 

The Python community has adopted a number of naming conventions for commonly 
used modules: 

import numpy as np 

import matpiotlib.pypiot as pit 

import pandas as pd 

import seaborn as sns 

import statsmodels as sm 

This means that when you see np. a range, this is a reference to the a range function in 
NumPy. This is done because it’s considered bad practice in Python software develop¬ 
ment to import everything (from numpy Import *) from a large package like NumPy. 

Jargon 

I’ll use some terms common both to programming and data science that you may not 
be familiar with. Thus, here are some brief definitions: 

Munge/munging/ wrangling 

Describes the overall process of manipulating unstructured and/or messy data 
into a structured or clean form. The word has snuck its way into the jargon of 
many modern-day data hackers. “Munge” rhymes with “grunge.” 

Pseudocode 

A description of an algorithm or process that takes a code-like form while likely 
not being actual valid source code. 

Syntactic sugar 

Programming syntax that does not add new features, but makes something more 
convenient or easier to type. 
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CHAPTER 2 


Python Language Basics, IPython, and 

Jupyter Notebooks 


When I wrote the first edition of this book in 2011 and 2012, there were fewer resour¬ 
ces available for learning about doing data analysis in Python. This was partially a 
chicken-and-egg problem; many libraries that we now take for granted, like pandas, 
scikit-learn, and statsmodels, were comparatively immature back then. In 2017, there 
is now a growing literature on data science, data analysis, and machine learning, sup¬ 
plementing the prior works on general-purpose scientific computing geared toward 
computational scientists, physicists, and professionals in other research fields. There 
are also excellent books about learning the Python programming language itself and 
becoming an effective software engineer. 

As this book is intended as an introductory text in working with data in Python, I feel 
it is valuable to have a self-contained overview of some of the most important fea¬ 
tures of Python’s built-in data structures and libraries from the perspective of data 
manipulation. So, I will only present roughly enough information in this chapter and 
Chapter 3 to enable you to follow along with the rest of the book. 

In my opinion, it is not necessary to become proficient at building good software in 
Python to be able to productively do data analysis. I encourage you to use the IPy¬ 
thon shell and Jupyter notebooks to experiment with the code examples and to 
explore the documentation for the various types, functions, and methods. While I’ve 
made best efforts to present the book material in an incremental form, you may occa¬ 
sionally encounter things that have not yet been fully introduced. 

Much of this book focuses on table-based analytics and data preparation tools for 
working with large datasets. In order to use those tools you must often first do some 
munging to corral messy data into a more nicely tabular (or structured) form. Fortu¬ 
nately, Python is an ideal language for rapidly whipping your data into shape. The 
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greater your facility with Python the language, the easier it will be for you to prepare 
new datasets for analysis. 

Some of the tools in this book are best explored from a live IPython or Jupyter ses¬ 
sion. Once you learn how to start up IPython and Jupyter, I recommend that you fol¬ 
low along with the examples so you can experiment and try different things. As with 
any keyboard-driven console-like environment, developing muscle-memory for the 
common commands is also part of the learning curve. 



There are introductory Python concepts that this chapter does not 
cover, like classes and object-oriented programming, which you 
may find useful in your foray into data analysis in Python. 

To deepen your Python language knowledge, I recommend that 
you supplement this chapter with the official Python tutorial and 
potentially one of the many excellent books on general-purpose 
Python programming. Some recommendations to get you started 
include: 


• Python Cookbook , Third Edition, by David Beazley and Brian 
K. Jones (O’Reilly) 

• Fluent Python by Luciano Ramalho (O’Reilly) 

• Effective Python by Brett Slatkin (Pearson) 


2.1 The Python Interpreter 

Python is an interpreted language. The Python interpreter runs a program by execut¬ 
ing one statement at a time. The standard interactive Python interpreter can be 
invoked on the command line with the python command: 

$ python 

Python 3.6.0 | packaged by conda-forge | (default, Jan 13 2017, 23:17:12) 

[GCC 4.8.2 20140120 (Red Hat 4.8.2-15)] on linux 

Type "help", "copyright", "credits" or "license" for more information. 

»> a = 5 
»> print(a) 

5 

The »> you see is the prompt where you’ll type code expressions. To exit the Python 
interpreter and return to the command prompt, you can either type exit() or press 
Ctrl-D. 

Running Python programs is as simple as calling python with a .py file as its first 
argument. Suppose we had created hello_world.py with these contents: 

print( ' Hello world') 
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You can run it by executing the following command (the hello_world.py file must be 
in your current working terminal directory): 

$ python hetlo_world.py 
Helto world 

While some Python programmers execute all of their Python code in this way, those 
doing data analysis or scientific computing make use of IPython, an enhanced Python 
interpreter, or Jupyter notebooks, web-based code notebooks originally created 
within the IPython project. I give an introduction to using IPython and Jupyter in 
this chapter and have included a deeper look at IPython functionality in Appendix A. 
When you use the %run command, IPython executes the code in the specified file in 
the same process, enabling you to explore the results interactively when it’s done: 

$ ipython 

Python 3.6.0 | packaged by conda-forge | (default, Jan 13 2017, 23:17:12) 

Type "copyright", "credits" or "license" for more information. 

IPython 5.1.0 -- An enhanced Interactive Python. 

? -> Introduction and overview of IPython' s features. 

%quickref -> Quick reference. 

help -> Python's own help system. 

object? -> Details about 'object', use 'object??' for extra details. 

In [1]: %run hello_world.py 
Hello world 

In [2]: 

The default IPython prompt adopts the numbered In [2]: style compared with the 
standard »> prompt. 

2.2 IPython Basics 

In this section, we’ll get you up and running with the IPython shell and Jupyter note¬ 
book, and introduce you to some of the essential concepts. 

Running the IPython Shell 

You can launch the IPython shell on the command line just like launching the regular 
Python interpreter except with the ipython command: 

$ ipython 

Python 3.6.0 | packaged by conda-forge | (default, Jan 13 2017, 23:17:12) 

Type "copyright", "credits" or "license" for more information. 

IPython 5.1.0 -- An enhanced Interactive Python. 

? -> Introduction and overview of IPython's features. 

%quickref -> Quick reference. 

help -> Python's own help system. 
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object? -> Details about 'object', use 'object??' for extra details. 


In [1]: a = 5 

In [2]: a 
Out[2]: 5 

You can execute arbitrary Python statements by typing them in and pressing Return 
(or Enter). When you type just a variable into IPython, it renders a string representa¬ 
tion of the object: 

In [5]: inport as 

In [6]: data = {i : np. random. randnQ for i in range(7)} 

In [7]: data 
0ut[7] : 

{0: -0.20470765948471295, 

1: 0.47894333805754824, 

2: -0.5194387150567381, 

3: -0.55573030434749, 

4: 1.9657805725027142, 

5: 1.3934058329729904, 

6: 0.09290787674371767} 

The first two lines are Python code statements; the second statement creates a vari¬ 
able named data that refers to a newly created Python dictionary. The last line prints 
the value of data in the console. 

Many kinds of Python objects are formatted to be more readable, or pretty-printed, 
which is distinct from normal printing with print. If you printed the above data 
variable in the standard Python interpreter, it would be much less readable: 

»> from inport randn 

»> data = {i : randn() for i in range(7)} 

»> print(data) 

{0: -1.5948255432744511, 1: 0.10569006472787983, 2: 1.972367135977295, 

3: 0.15455217573074576, 4: -0.24058577449429575, 5: -1.2904897053651216, 

6: 0.3308507317325902} 

IPython also provides facilities to execute arbitrary blocks of code (via a somewhat 
glorified copy-and-paste approach) and whole Python scripts. You can also use the 
Jupyter notebook to work with larger blocks of code, as we’ll soon see. 

Running the Jupyter Notebook 

One of the major components of the Jupyter project is the notebook, a type of interac¬ 
tive document for code, text (with or without markup), data visualizations, and other 
output. The Jupyter notebook interacts with kernels, which are implementations of 
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the Jupyter interactive computing protocol in any number of programming lan¬ 
guages. Python’s Jupyter kernel uses the IPython system for its underlying behavior. 

To start up Jupyter, run the command jupyter notebook in a terminal: 

$ jupyter notebook 

[I 15:20:52.739 NotebookApp] Serving notebooks from local directory: 

/home/wesm/code/pydata-book 

[I 15:20:52.739 NotebookApp] 0 active kernels 

[I 15:20:52.739 NotebookApp] The Jupyter Notebook is running at: 

http://localhost:8888/ 

[I 15:20:52.740 NotebookApp] Use Control-C to stop this server and shut down 
all kernels (twice to skip confirmation). 

Created new window in existing browser session. 

On many platforms, Jupyter will automatically open up in your default web browser 
(unless you start it with - -no-browser). Otherwise, you can navigate to the HTTP 
address printed when you started the notebook, here http ://localhost: 8888/. See 
Figure 2-1 for what this looks like in Google Chrome. 



Many people use Jupyter as a local computing environment, but it 
can also be deployed on servers and accessed remotely. I won’t 
cover those details here, but encourage you to explore this topic on 
the internet if it’s relevant to your needs. 


"jupyter 


Files Running Clusters 
Select items to perform actions on them. 
□ - * 

□ C) ch02 

□ Cl ch03 

□ Cl ch06 
S Cl ch07 
@ D ch08 
B D ch09 
0 D chll 
0 Cl chl3 

0 appendix_python.ipynb 
O B ch02.ipynb 
U 0 ch04.ipynb 
O 0 ch05.ipynb 
O 0 ch06.ipynb 

□ 0 ch07.ipynb 
D 0 ch08.ipynb 


Figure 2-1. Jupyter notebook landing page 
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To create a new notebook, click the New button and select the “Python 3” or “conda 
[default]” option. You should see something like Figure 2-2. If this is your first time, 
try clicking on the empty code “cell” and entering a line of Python code. Then press 
Shift-Enter to execute it. 



Figure 2-2. Jupyter new notebook view 

When you save the notebook (see “Save and Checkpoint” under the notebook File 
menu), it creates a file with the extension .ipynb. This is a self-contained file format 
that contains all of the content (including any evaluated code output) currently in the 
notebook These can be loaded and edited by other Jupyter users. To load an existing 
notebook, put the file in the same directory where you started the notebook process 
(or in a subfolder within it), then double-click the name from the landing page. You 
can try it out with the notebooks from my wesm/pydata-book repository on GitHub. 
See Figure 2-3. 

While the Jupyter notebook can feel like a distinct experience from the IPython shell, 
nearly all of the commands and tools in this chapter can be used in either environ¬ 
ment. 
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^jupyter Ch02 (unsaved changes) 

File Edit View Insert Cell Kernel Help | Python 3 O 

Ei + 3C | Q) Q * * M ■ C Code * m CellToolbar 

Introductory examples 

l.usa.gov data from bit.ly H 

In [ ]: °spwd 

In [ ]: path = , ch02/usagov_bitly_data2012-03-16-1331923249.txt' 

In [ ]: open(path).readline() 

In [ ]: import json 

path = 1 ch02Zusagov_bitly_data2012-03-16-1331923249.txt' 
records = [json.loads(line) for line in open(path)] 

In [ ]: records[0] 

In [ ]: records[0][' tz' ] 

In [ ]: print (records[0][ 1 tz ']) 

Counting time zones in pure Python 

Figure 2-3. Jupyter example view for an existing notebook 

Tab Completion 

On the surface, the IPython shell looks like a cosmetically different version of the 
standard terminal Python interpreter (invoked with python). One of the major 
improvements over the standard Python shell is tab completion, found in many IDEs 
or other interactive computing analysis environments. While entering expressions in 
the shell, pressing the Tab key will search the namespace for any variables (objects, 
functions, etc.) matching the characters you have typed so far: 

In [1]: an_appte = 27 
In [2]: an_example = 42 
In [3]: an<Tat» 

an_apple and an_example any 

In this example, note that IPython displayed both the two variables I defined as well 
as the Python keyword and and built-in function any. Naturally, you can also com¬ 
plete methods and attributes on any object after typing a period: 
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In [3]: b = [1, 2, 3] 


In [4]: b.<Tab> 

b.append b.count b.insert b.reverse 
b.clear b.extend b.pop b.sort 
b.copy b.index b.remove 

The same goes for modules: 

In [1]: import datetime 

In [2]: datetime.<Tab> 

datetime.date datetime.MAXYEAR datetime.timedelta 

datetime.datetime datetime.MINYEAR datetime.timezone 

datetime.datetime_CAPI datetime.time datetime.tzinfo 

In the Jupyter notebook and newer versions of IPython (5.0 and higher), the auto¬ 
completions show up in a drop-down box rather than as text output. 



Note that IPython by default hides methods and attributes starting 
with underscores, such as magic methods and internal “private” 
methods and attributes, in order to avoid cluttering the display 
(and confusing novice users!). These, too, can be tab-completed, 
but you must first type an underscore to see them. If you prefer to 
always see such methods in tab completion, you can change this 
setting in the IPython configuration. See the IPython documenta¬ 
tion to find out how to do this. 


Tab completion works in many contexts outside of searching the interactive name- 
space and completing object or module attributes. When typing anything that looks 
like a file path (even in a Python string), pressing the Tab key will complete anything 
on your computer s filesystem matching what you’ve typed: 

In [7]: datasets/movlelens/<Tab> 

datasets/movielens/movies.dat datasets/movielens/README 
datasets/movielens/ratings.dat datasets/movielens/users.dat 

In [7]: path = 'datasets/movielens/<Tab> 
datasets/movielens/movies.dat datasets/movielens/README 
datasets/movielens/ratings.dat datasets/movielens/users.dat 

Combined with the %run command (see “The %run Command” on page 25), this 
functionality can save you many keystrokes. 

Another area where tab completion saves time is in the completion of function key¬ 
word arguments (and including the = sign!). See Figure 2-4. 


22 | Chapter 2: Python Language Basics, IPython, and Jupyter Notebooks 






In [12]: def func with keywords(abra=l, abbra=2, abbbra=3): 
return abra, abbra, abbbra 


In [ ]: func with keywords ab| 

abbbra= 

abbra= 

abra= 

abs 


Figure 2-4. Auto complete function keywords in Jupyter notebook 
We’ll have a closer look at functions in a little bit. 

Introspection 

Using a question mark (?) before or after a variable will display some general infor¬ 
mation about the object: 

In [8]: b = [1, 2, 3] 

In [9]: b? 

Type: list 

String Form:[l, 2, 3] 

Length: 3 

Docstring: 

list() -> new empty list 

list(iterable) -> new list initialized from 's items 

In [10]: print? 

Docstring: 

print(value, . .., sep=' ', end='\n', file=sys . stdout, flush=False) 

Prints the values to a stream, or to sys.stdout by default. 

Optional keyword arguments: 

file: a file-like object (stream); defaults to the current sys.stdout. 
sep: string inserted between values, default a space, 

end: string appended after the last value, default a newline, 

flush: whether to forcibly flush the stream. 

Type: builtin_function_or_method 

This is referred to as object introspection. If the object is a function or instance 
method, the docstring, if defined, will also be shown. Suppose we’d written the follow¬ 
ing function (which you can reproduce in IPython or Jupyter): 
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def add_numbers(a, b): 

Add two numbers together 
Returns 


the_sum : type of arguments 

return a + b 

Then using ? shows us the docstring: 

In [11]: add_numbers? 

Signature: add_numbers(a, b) 
Docstring: 

Add two numbers together 
Returns 


the_sum : type of arguments 

File: <ipython-input-9-6aS48a216e27> 

Type: function 

Using ?? will also show the function’s source code if possible: 

In [12]: add_numbers?? 

Signature: add_numbers(a, b) 

Source: 

def add_numbers(a, b): 

Add two numbers together 
Returns 


the_sum : type of arguments 

return a + b 

File: <ipython-input-9-6aS48a216e27> 

Type: function 

? has a final usage, which is for searching the IPython namespace in a manner similar 
to the standard Unix or Windows command line. A number of characters combined 
with the wildcard (*) will show all names matching the wildcard expression. For 
example, we could get a list of all functions in the top-level NumPy namespace con¬ 
taining load: 

In [13]: np.*load*? 

np._loader_ 

np.load 
np.loads 
np.loadtxt 
np.pkgload 
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The %run Command 

You can run any file as a Python program inside the environment of your IPython 
session using the %run command. Suppose you had the following simple script stored 
in ipython_script_test.py: 

def f(x, y, z): 

return (x + y) / z 

a = 5 
b = 6 
c = 7.5 

result = f(a, b, c) 

You can execute this by passing the filename to %run: 

In [14]: %run ipython_script_test.py 

The script is run in an empty namespace (with no imports or other variables defined) 
so that the behavior should be identical to running the program on the command line 
using python script.py. All of the variables (imports, functions, and globals) 
defined in the file (up until an exception, if any, is raised) will then be accessible in 
the IPython shell: 

In [15]: c 
Out [15]: 7.5 

In [16]: result 

0ut[16] : 1.4666666666666666 

If a Python script expects command-line arguments (to be found in sys. argv), these 
can be passed after the file path as though run on the command line. 



Should you wish to give a script access to variables already defined 
in the interactive IPython namespace, use %run -i instead of plain 
%run. 


In the Jupyter notebook, you may also use the related %load magic function, which 
imports a script into a code cell: 

»> %load lpython_scrlpt_test.py 

def f(x, y, z): 

return (x + y) / z 


a = 5 
b = 6 
c = 7.5 
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result = f(a, b, c) 


Interrupting running code 

Pressing Ctrl-C while any code is running, whether a script through %run or a long- 
running command, will cause a Keyboardlnterrupt to be raised. This will cause 
nearly all Python programs to stop immediately except in certain unusual cases. 



When a piece of Python code has called into some compiled exten¬ 
sion modules, pressing Ctrl-C will not always cause the program 
execution to stop immediately. In such cases, you will have to 
either wait until control is returned to the Python interpreter, or in 
more dire circumstances, forcibly terminate the Python process. 


Executing Code from the Clipboard 

If you are using the Jupyter notebook, you can copy and paste code into any code cell 
and execute it. It is also possible to run code from the clipboard in the IPython shell. 
Suppose you had the following code in some other application: 

x = 5 

y = 7 

if x > 5: 
x += 1 

y = 8 

The most foolproof methods are the %paste and %cpaste magic functions. %paste 
takes whatever text is in the clipboard and executes it as a single block in the shell: 

In [17]: %paste 
x = 5 

y = 7 

if x > 5: 
x += 1 

y - 8 

## -- End pasted text -- 

%cpaste is similar, except that it gives you a special prompt for pasting code into: 

In [18]: %cpaste 

Pasting code; enter alone on the line to stop or use Ctrl-D. 

:x = 5 

:y = 7 

:if x > 5: 

: x += 1 
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y = 8 


With the %cpaste block, you have the freedom to paste as much code as you like 
before executing it. You might decide to use %cpaste in order to look at the pasted 
code before executing it. If you accidentally paste the wrong code, you can break out 
of the %cpaste prompt by pressing Ctrl-C. 

Terminal Keyboard Shortcuts 

IPython has many keyboard shortcuts for navigating the prompt (which will be famil¬ 
iar to users of the Emacs text editor or the Unix bash shell) and interacting with the 
shells command history. Table 2-1 summarizes some of the most commonly used 
shortcuts. See Figure 2-5 for an illustration of a few of these, such as cursor 
movement. 


C-b C-f 



◄-► 



In [27]: a variable 

In [ 27 ]: a vari 

C-k 

t t 

C-a C-e 

In [ 27 ]: 

C-u 


Figure 2-5. Illustration of some keyboard shortcuts in the IPython shell 


Table 2-1. Standard IPython keyboard shortcuts 


I Keyboard shortcut 

Description 1 

Ctrl-P or up-arrow 

Search backward in command history for commands starting with currently entered text 

Ctrl-N or down-arrow 

Search forward in command history for commands starting with currently entered text 

Ctrl-R 

Readline-style reverse history search (partial matching) 

Ctrl-Shift-V 

Paste text from clipboard 

Ctrl-C 

Interrupt currently executing code 

Ctrl-A 

Move cursor to beginning of line 

Ctrl-E 

Move cursor to end of line 

Ctrl-K 

Delete text from cursor until end of line 

Ctrl-U 

Discard all text on current line 

Ctrl-F 

Move cursor forward one character 

Ctrl-B 

Move cursor back one character 

Ctrl-L 

Clear screen 


Note that Jupyter notebooks have a largely separate set of keyboard shortcuts for nav¬ 
igation and editing. Since these shortcuts have evolved more rapidly than IPython’s, I 
encourage you to explore the integrated help system in the Jupyter notebooks menus. 
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About Magic Commands 

IPython’s special commands (which are not built into Python itself) are known as 
“magic” commands. These are designed to facilitate common tasks and enable you to 
easily control the behavior of the IPython system. A magic command is any com¬ 
mand prefixed by the percent symbol %. For example, you can check the execution 
time of any Python statement, such as a matrix multiplication, using the 
magic function (which will be discussed in more detail later): 

In [20]: a = np.random.randn(100, 100) 

In [20]: %timeit np.dot(a, a) 

10000 loops, best of 3: 20.9 ps per loop 

Magic commands can be viewed as command-line programs to be run within the 
IPython system. Many of them have additional “command-line” options, which can 
all be viewed (as you might expect) using ?: 

In [21]: %debug? 

Docstring : 


%debug [--breakpoint FILE:LINE] [statement [statement ...]] 

Activate the interactive debugger. 

This magic command support two ways of activating debugger. 

One is to activate debugger before executing code. This way, you 
can set a break point, to step through the code from 
You can use this mode by giving statements to execute and optionally 
a breakpoint. 

The other one is to activate debugger in post-mortem mode. You can 
activate this mode simply running %debug without any argument. 

If an exception has just occurred, this lets you inspect its stack 
frames interactively. Note that this will always work only on the last 
traceback that occurred, so you must call this quickly after an 
exception that you wish to inspect has fired, because if another one 
occurs, it clobbers the previous one. 

If you want IPython to automatically do this on every exception, see 
the %pdb magic for more details. 

positional arguments: 

statement Code to run in debugger. You can omit this in cell 

magic mode. 


optional arguments: 

--breakpoint <FILE:LINE>, -b <FILE:LINE> 

Set break point at LINE in FILE. 
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Magic functions can be used by default without the percent sign, as long as no vari¬ 
able is defined with the same name as the magic function in question. This feature is 
called automagic and can be enabled or disabled with %automagic. 

Some magic functions behave like Python functions and their output can be assigned 
to a variable: 

In [22]: %pwd 

0ut[22] : '/home/wesm/code/pydata-book 
In [23]: foo = %pwd 
In [24]: foo 

0ut[24] : '/home/wesm/code/pydata-book' 

Since IPython’s documentation is accessible from within the system, I encourage you 
to explore all of the special commands available by typing %quickref or %maglc. 
Table 2-2 highlights some of the most critical ones for being productive in interactive 
computing and Python development in IPython. 


Table 2-2. Some frequently used IPython magic commands 


Command 


Description 


%quickref 

%magtc 

%debug 

%hlst 

%pdb 

%paste 

%cpaste 

%reset 

%page OBJECT 

%run script.py 

%prun statement 

%time statement 

%timeit statement 

%who, %who_ls, %whos 

%xdet variable 


Display the IPython Quick Reference Card 

Display detailed documentation for all of the available magic commands 

Enter the interactive debugger at the bottom of the last exception traceback 

Print command input (and optionally output) history 

Automatically enter debugger after any exception 

Execute preformatted Python code from clipboard 

Open a special prompt for manually pasting Python code to be executed 

Delete all variables/names defined in interactive namespace 

Pretty-print the object and display it through a pager 

Run a Python script inside IPython 

Execute statement with cProftte and report the profiler output 
Report the execution time of a single statement 

Run a statement multiple times to compute an ensemble average execution time; useful for 
timing code with very short execution time 

Display variables defined in interactive namespace, with varying levels of information/ 
verbosity 

Delete a variable and attempt to clear any references to the object in the IPython internals 


Matplotlib Integration 

One reason for IPython’s popularity in analytical computing is that it integrates well 
with data visualization and other user interface libraries like matplotlib. Don’t worry 
if you have never used matplotlib before; it will be discussed in more detail later in 
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this book. The %rnatplotlib magic function configures its integration with the IPy- 
thon shell or Jupyter notebook. This is important, as otherwise plots you create will 
either not appear (notebook) or take control of the session until closed (shell). 

In the IPython shell, running %matplotlib sets up the integration so you can create 
multiple plot windows without interfering with the console session: 

In [26]: %matplotlib 

Using matplotlib backend: Qt4Agg 

In Jupyter, the command is a little different (Figure 2-6): 

In [26]: %matptotIib Inline 


In [14]: -smatplotlib inline 

In [15]: import matplotlib.pyplot as pit 

pit.plot(np.random.randn(50).cumsum ) 

Out[15]: [cmatplotlib. lines. Line2D at 0x7f828f0497f0>] 



Tn r 1 • I III 

Figure 2-6. Jupyter inline matplotlib plotting 

2.3 Python Language Basics 

In this section, I will give you an overview of essential Python programming concepts 
and language mechanics. In the next chapter, I will go into more detail about Pythons 
data structures, functions, and other built-in tools. 

Language Semantics 

The Python language design is distinguished by its emphasis on readability, simplic¬ 
ity, and explicitness. Some people go so far as to liken it to “executable pseudocode.” 

Indentation, not braces 

Python uses whitespace (tabs or spaces) to structure code instead of using braces as in 
many other languages like R, C++, Java, and Perl. Consider a for loop from a sorting 
algorithm: 
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for x in array: 
if x < pivot: 

less.append(x) 
else: 

greater.append(x) 

A colon denotes the start of an indented code block after which all of the code must 
be indented by the same amount until the end of the block. 

Love it or hate it, significant whitespace is a fact of life for Python programmers, and 
in my experience it can make Python code more readable than other languages I’ve 
used. While it may seem foreign at first, you will hopefully grow accustomed in time. 



I strongly recommend using four spaces as your default indentation 
and replacing tabs with four spaces. Many text editors have a set¬ 
ting that will replace tab stops with spaces automatically (do this!). 
Some people use tabs or a different number of spaces, with two 
spaces not being terribly uncommon. By and large, four spaces is 
the standard adopted by the vast majority of Python programmers, 
so I recommend doing that in the absence of a compelling reason 
otherwise. 


As you can see by now, Python statements also do not need to be terminated by semi¬ 
colons. Semicolons can be used, however, to separate multiple statements on a single 
line: 


a=5;b=6;c=7 

Putting multiple statements on one line is generally discouraged in Python as it often 
makes code less readable. 

Everything is an object 

An important characteristic of the Python language is the consistency of its object 
model. Every number, string, data structure, function, class, module, and so on exists 
in the Python interpreter in its own “box,” which is referred to as a Python object. 
Each object has an associated type (e.g., string or function) and internal data. In prac¬ 
tice this makes the language very flexible, as even functions can be treated like any 
other object. 

Comments 

Any text preceded by the hash mark (pound sign) # is ignored by the Python inter¬ 
preter. This is often used to add comments to code. At times you may also want to 
exclude certain blocks of code without deleting them. An easy solution is to comment 
out the code: 
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results = [] 

for line in filehandle: 

# keep the empty lines for now 

# if len(line) == 0: 

# continue 

results.append(line.replace( 1 foo' , ' bar' )) 

Comments can also occur after a line of executed code. While some programmers 
prefer comments to be placed in the line preceding a particular line of code, this can 
be useful at times: 

print( "Reached this line") # Simple status report 

Function and object method calls 

You call functions using parentheses and passing zero or more arguments, optionally 
assigning the returned value to a variable: 

result = f(x, y, z) 

9 () 

Almost every object in Python has attached functions, known as methods, that have 
access to the object’s internal contents. You can call them using the following syntax: 

obj.some_method(x, y, z) 

Functions can take both positional and keyword arguments: 

result = f(a, b, c, d=5, e='foo') 

More on this later. 

Variables and argument passing 

When assigning a variable (or name) in Python, you are creating a reference to the 
object on the righthand side of the equals sign. In practical terms, consider a list of 
integers: 

In [8]: a = [1, 2, 3] 

Suppose we assign a to a new variable b: 

In [9]: b = a 

In some languages, this assignment would cause the data [ 1, 2, 3 ] to be copied. In 
Python, a and b actually now refer to the same object, the original list [1, 2, 3] (see 
Figure 2-7 for a mockup). You can prove this to yourself by appending an element to 
a and then examining b: 

In [10]: a.append(4) 

In [11]: b 

Out[ll] : [1, 2, 3, 4] 
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a-► 

b-► 

list 

[1, 2, 3] 





Figure 2-7. Two references for the same object 


Understanding the semantics of references in Python and when, how, and why data is 
copied is especially critical when you are working with larger datasets in Python. 



Assignment is also referred to as binding , as we are binding a name 
to an object. Variable names that have been assigned may occasion¬ 
ally be referred to as bound variables. 


When you pass objects as arguments to a function, new local variables are created ref¬ 
erencing the original objects without any copying. If you bind a new object to a vari¬ 
able inside a function, that change will not be reflected in the parent scope. It is 
therefore possible to alter the internals of a mutable argument. Suppose we had the 
following function: 

def append_element(some_list, etement): 
some_ltst .append(element) 

Then we have: 

In [27]: data = [1, 2, 3] 

In [28]: append_etement(data, 4) 

In [29]: data 
0ut[29] : [1, 2, 3, 4] 

Dynamic references, strong types 

In contrast with many compiled languages, such as Java and C++, object references in 
Python have no type associated with them. There is no problem with the following: 

In [12]: a = 5 

In [13]: type(a) 

0ut[13]: int 

In [14]: a = 'foo' 
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In [15]: type(a) 

Out[15] : str 

Variables are names for objects within a particular namespace; the type information is 
stored in the object itself. Some observers might hastily conclude that Python is not a 
“typed language.” This is not true; consider this example: 

In [16]: '5' + 5 


TypeError Traceback (most recent call last) 

<lpython-lnput-16-f9dbf5f0b234> in <module>() 

-> 1 '5' + 5 

TypeError: must be str, not int 

In some languages, such as Visual Basic, the string ' 5' might get implicitly converted 
(or casted) to an integer, thus yielding 10. Yet in other languages, such as JavaScript, 
the integer 5 might be casted to a string, yielding the concatenated string ' 55'. In this 
regard Python is considered a strongly typed language, which means that every object 
has a specific type (or class), and implicit conversions will occur only in certain obvi¬ 
ous circumstances, such as the following: 

In [17]: a = 4.5 
In [18]: b = 2 

# String fornatting, to be visited later 

In [19]: print('a is {0}, b is {1}' .format(type(a) , type(b))) 
a is <class ' float '>, b is <class ' int '> 

In [20]: a / b 
Out[20] : 2.25 

Knowing the type of an object is important, and it’s useful to be able to write func¬ 
tions that can handle many different kinds of input. You can check that an object is an 
instance of a particular type using the isinstance function: 

In [21]: a = 5 

In [22]: isinstance(a, int) 

0ut[22]: True 

isinstance can accept a tuple of types if you want to check that an object’s type is 
among those present in the tuple: 

In [23]: a = 5; b = 4.5 

In [24]: isinstance(a, (int, float)) 

0ut[24]: True 

In [25]: isinstance(b, (int, float)) 

0ut[25]: True 
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Attributes and methods 

Objects in Python typically have both attributes (other Python objects stored “inside” 
the object) and methods (functions associated with an object that can have access to 
the objects internal data). Both of them are accessed via the syntax 
obj. attribute_nane: 

In [1]: a = 'foo' 

In [2]: a.<Press Tab> 


a. capitalize 

a.format 

a . isupper 

a.rindex 

a.strip 

a.center 

a.index 

a.join 

a.rjust 

a.swapcase 

a.count 

a.isalnum 

a.ljust 

a.rpartition 

a.title 

a.decode 

a.isalpha 

a.lower 

a.rsplit 

a.translate 

a.encode 

a.isdigit 

a.Istrip 

a.rstrip 

a.upper 

a.endswith 

a.islower 

a.partition 

a.split 

a.zfill 

a.expandtabs 

a.isspace 

a.replace 

a.splitlines 


a.find 

a.istitle 

a . rfind 

a.startswith 



Attributes and methods can also be accessed by name via the getattr function: 

In [27]: getattr(a, 'split') 

0ut[27]: <function str.split> 

In other languages, accessing objects by name is often referred to as “reflection.” 
While we will not extensively use the functions getattr and related functions 
hasattr and setattr in this book, they can be used very effectively to write generic, 
reusable code. 

Duck typing 

Often you may not care about the type of an object but rather only whether it has 
certain methods or behavior. This is sometimes called “duck typing,” after the saying 
“If it walks like a duck and quacks like a duck, then it’s a duck.” For example, you can 
verify that an object is iterable if it implemented the iterator protocol. For many 
objects, this means it has a_ iter _“magic method,” though an alternative and bet¬ 

ter way to check is to try using the iter function: 

def isiterable(obj ): 

try: 

iter(obj) 
return True 

except TypeError: # not iterable 
return False 

This function would return True for strings as well as most Python collection types: 

In [29]: islterable(' a string') 

0ut[29]: True 

In [30]: isiterable([l, 2, 3]) 
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Out [ 30 ]: True 


In [31]: lstterable(5) 

Out[31]: False 

A place where I use this functionality all the time is to write functions that can accept 
multiple kinds of input. A common case is writing a function that can accept any 
kind of sequence (list, tuple, ndarray) or even an iterator. You can first check if the 
object is a list (or a NumPy array) and, if it is not, convert it to be one: 

if not tsinstance(x, list) and isiterable(x) : 
x = list(x) 


Imports 

In Python a module is simply a file with the .py extension containing Python code. 
Suppose that we had the following module: 

# some_module.py 
PI = 3.14159 

def f(x): 

return x + 2 

def g(a, b) : 

return a + b 

If we wanted to access the variables and functions defined in some_module.py, from 
another file in the same directory we could do: 

import some_module 

result = some_module.f(5) 
pi = some_module.PI 

Or equivalently: 

from import f, g, PI 

result = g(5, PI) 

By using the as keyword you can give imports different variable names: 

import some_module as sm 

from import PI as pi, g as gf 

rl = sm.f(pi) 
r2 = gf(6, pi) 

Binary operators and comparisons 

Most of the binary math operations and comparisons are as you might expect: 

In [32]: 5 - 7 
0ut[32] : -2 
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In [33]: 12 + 21.5 
Out[33] : 33.5 

In [34]: 5 <= 2 
0ut[34]: False 

See Table 2-3 for all of the available binary operators. 

To check if two references refer to the same object, use the is keyword, is not is also 
perfectly valid if you want to check that two objects are not the same: 

In [35]: a = [1, 2, 3] 

In [36]: b = a 

In [37]: c = list(a) 

In [38]: a is b 
0ut[38]: True 

In [39]: a is not c 
0ut[39]: True 

Since list always creates a new Python list (i.e., a copy), we can be sure that c is dis¬ 
tinct from a. Comparing with is is not the same as the == operator, because in this 
case we have: 

In [40]: a == c 
Out[40]: True 

A very common use of is and is not is to check if a variable is None, since there is 
only one instance of None: 

In [41]: a = None 

In [42]: a is None 
0ut[42]: True 

Table 2-3. Binary operators 


Operation Description 


a + b Add a and b 

a - b Subtract b from a 

a * b Multiply a by b 

a / b Divide a by b 

a // b Floor-divide a by b, dropping any fractional remainder 

a ** b Raise a to the b power 

a & b True if both a and b are True; for integers, take the bitwise AND 

a | b True if either a or bis True; for integers, take the bitwise OR 

a A b For booleans, True if a or b is True, but not both; for integers, take the bitwise EXCLUSIVE-OR 
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Operation Description 


a == b True if a equals b 

a != b True if a is not equal to b 

a <= b, a < b True if a is less than (less than or equal) to b 

a > b, a >= b True if a is greater than (greater than or equal) to b 

a is b True if a and b reference the same Python object 

a is not b True if a and b reference different Python objects 


Mutable and immutable objects 

Most objects in Python, such as lists, diets, NumPy arrays, and most user-defined 
types (classes), are mutable. This means that the object or values that they contain can 
be modified: 

In [43]: a_list = ['foo', 2, [4, 5]] 

In [44]: a_list[2] = (3, 4) 

In [45]: a_tist 

0ut[45]: ['foo', 2, (3, 4)] 

Others, like strings and tuples, are immutable: 

In [46]: a_tuple = (3, 5, (4, 5)) 

In [47]: a_tuple[l] = 'four' 

TypeError Traceback (most recent call last) 

<lpython-input-47-b7966a9ae0fl> in <module>() 

-> 1 a_tuple[l] = 'four' 

TypeError: ’tuple’ object does not support item assignment 

Remember that just because you can mutate an object does not mean that you always 
should. Such actions are known as side effects. For example, when writing a function, 
any side effects should be explicitly communicated to the user in the functions docu¬ 
mentation or comments. If possible, I recommend trying to avoid side effects and 
favor immutability, even though there may be mutable objects involved. 

Scalar Types 

Python along with its standard library has a small set of built-in types for handling 
numerical data, strings, boolean (True or False) values, and dates and time. These 
“single value” types are sometimes called scalar types and we refer to them in this 
book as scalars. See Table 2-4 for a list of the main scalar types. Date and time han¬ 
dling will be discussed separately, as these are provided by the datetime module in 
the standard library. 
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Table 2-4. Standard Python scalar types 


Type Description 


None The Python "null" value (only one instance of the None object exists) 

str String type; holds Unicode (UTF-8 encoded) strings 

bytes Raw ASCII bytes (or Unicode encoded as bytes) 

float Double-precision (64-bit) floating-point number (note there is no separate double type) 

bool A True or False value 

int Arbitrary precision signed integer 


Numeric types 

The primary Python types for numbers are int and float. An int can store arbitrar¬ 
ily large numbers: 

In [48]: ival = 17239871 
In [49]: ival ** 6 

0ut[49] : 26254519291092456596965462913230729701102721 

Floating-point numbers are represented with the Python float type. Under the hood 
each one is a double-precision (64-bit) value. They can also be expressed with scien¬ 
tific notation: 

In [50]: fval = 7.243 
In [51]: fval2 = 6.78e-5 

Integer division not resulting in a whole number will always yield a floating-point 
number: 

In [52]: 3/2 
0ut[52] : 1.5 

To get C-style integer division (which drops the fractional part if the result is not a 
whole number), use the floor division operator //: 

In [53]: 3 // 2 
0ut[53] : 1 

Strings 

Many people use Python for its powerful and flexible built-in string processing capa¬ 
bilities. You can write string literals using either single quotes ' or double quotes ": 

a = 'one way of writing a string' 
b = "another way" 

For multiline strings with line breaks, you can use triple quotes, either 11 ' or """: 
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This is a longer string that 
spans multiple lines 


It may surprise you that this string c actually contains four lines of text; the line 
breaks after """ and after lines are included in the string. We can count the new line 
characters with the count method on c: 

In [55]: c.count( '\n' ) 

0ut[55]: 3 

Python strings are immutable; you cannot modify a string: 

In [56]: a = 'this is a string' 

In [57]: a[10] = 'f' 


TypeError Traceback (most recent call last) 

<ipython-input-57-5ca625dle504> in <module>() 

-> 1 a[10] = ' f' 

TypeError: 'str' object does not support item assignment 
In [58]: b = a.replace( 'string' , 'longer string 1 ) 

In [59]: b 

0ut[59]: 'this is a longer string' 

Afer this operation, the variable a is unmodified: 

In [60]: a 

Out[60]: 'this is a string' 

Many Python objects can be converted to a string using the str function: 

In [61]: a = 5.6 

In [62]: s = str(a) 

In [63]: print(s) 

5.6 

Strings are a sequence of Unicode characters and therefore can be treated like other 
sequences, such as lists and tuples (which we will explore in more detail in the next 
chapter): 

In [64]: s = 'python' 

In [65]: list(s) 

0ut[65] : [V, 'y', 't', 'h\ 'o', ' n' ] 

In [66]: s[:3] 

0ut[66]: 'pyt' 
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The syntax s[:3] is called slicing and is implemented for many kinds of Python 
sequences. This will be explained in more detail later on, as it is used extensively in 
this book 

The backslash character \ is an escape character, meaning that it is used to specify 
special characters like newline \n or Unicode characters. To write a string literal with 
backslashes, you need to escape them: 

In [67]: s = ' 12\\34' 

In [68]: print(s) 

12\34 

If you have a string with a lot of backslashes and no special characters, you might find 
this a bit annoying. Fortunately you can preface the leading quote of the string with r, 
which means that the characters should be interpreted as is: 

In [69]: s = r 1 thts\has\no\spectal\characters 1 
In [70]: s 

Out [70]: 'this\\has\\no\\speciat\\characters' 

The r stands for raw. 

Adding two strings together concatenates them and produces a new string: 

In [71]: a = 'this is the first half ' 

In [72]: b = 'and this is the second half' 

In [73]: a + b 

0ut[73]: 'this is the first half and this is the second half' 

String templating or formatting is another important topic. The number of ways to 
do so has expanded with the advent of Python 3, and here I will briefly describe the 
mechanics of one of the main interfaces. String objects have a format method that 
can be used to substitute formatted arguments into the string, producing a new 
string: 

In [74]: template = '{0:.2f} {l:s} are worth US${2:d}' 

In this string, 

• {0:. 2f} means to format the first argument as a floating-point number with two 
decimal places. 

• {1: s} means to format the second argument as a string. 

• {2: d} means to format the third argument as an exact integer. 

To substitute arguments for these format parameters, we pass a sequence of argu¬ 
ments to the format method: 
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In [75]: template.fomat(4. 5560, 'Argentine Pesos', 1) 

0ut[75]: '4.56 Argentine Pesos are worth US$1’ 

String formatting is a deep topic; there are multiple methods and numerous options 
and tweaks available to control how values are formatted in the resulting string. To 
learn more, I recommend consulting the official Python documentation. 

I discuss general string processing as it relates to data analysis in more detail in Chap¬ 
ter 8. 

Bytes and Unicode 

In modern Python (i.e., Python 3.0 and up), Unicode has become the first-class string 
type to enable more consistent handling of ASCII and non-ASCII text. In older ver¬ 
sions of Python, strings were all bytes without any explicit Unicode encoding. You 
could convert to Unicode assuming you knew the character encoding. Let’s look at an 
example: 

In [76]: vat = "espanol" 

In [77]: val 
0ut[77]: 'espanol' 

We can convert this Unicode string to its UTF-8 bytes representation using the 
encode method: 

In [78]: val_utf8 = val,encode( 'utf-8' ) 

In [79]: val_utf8 
0ut[79]: b 'espa\xc3\xblol' 

In [80]: type(val_utf8) 

Out[80]: bytes 

Assuming you know the Unicode encoding of a bytes object, you can go back using 
the decode method: 

In [81]: val_utf8.decode!' utf-8' ) 

0ut[81]: 'espanol' 

While it’s become preferred to use UTF-8 for any encoding, for historical reasons you 
may encounter data in any number of different encodings: 

In [82]: val.encode! 'latinl' ) 

0ut[82]: b 'espa\xflol' 

In [83]: val.encode!' utf-16' ) 

Out [83]: b '\xff\xfee\x00s\x00p\x00a\x00\xfl\x00o\x00l\xO0' 

In [84]: val.encode!' utf-16le' ) 

Out [84]: b 'e\x00s\x00p\x00a\x00\xfI\x00o\x00l\x00' 
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It is most common to encounter bytes objects in the context of working with files, 
where implicitly decoding all data to Unicode strings may not be desired. 

Though you may seldom need to do so, you can define your own byte literals by pre¬ 
fixing a string with b: 

In [85]: bytes_val = b'thls Is bytes' 

In [86]: bytes_val 
0ut[86]: b'this is bytes' 

In [87]: decoded = bytes_va!.decode( 'utf8' ) 

In [88]: decoded # this is str (Unicode) now 
0ut[88]: 'this is bytes' 

Booleans 

The two boolean values in Python are written as True and False. Comparisons and 
other conditional expressions evaluate to either True or False. Boolean values are 
combined with the and and or keywords: 

In [89]: True and True 
0ut[89]: True 

In [90]: False or True 
Out[90]: True 

Type casting 

The str, bool, int, and float types are also functions that can be used to cast values 
to those types: 

In [91]: s = '3.14159' 

In [92]: fval = float(s) 

In [93]: type(fval) 

0ut[93]: float 

In [94]: int(fval) 

0ut[94] : 3 

In [95]: bool(fval) 

0ut[95]: True 

In [96]: bool(0) 

0ut[96]: False 
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None 


None is the Python null value type. If a function does not explicitly return a value, it 
implicitly returns None: 

In [97]: a = None 

In [98]: a is None 
0ut[98]: True 

In [99]: b = 5 

In [100]: b is not None 
Out[100]: True 

None is also a common default value for function arguments: 

def add_and_maybe_multiply(a, b, c=None): 
result = a + b 

if c is not None: 

result = result * c 

return result 

While a technical point, it’s worth bearing in mind that None is not only a reserved 
keyword but also a unique instance of NoneType: 

In [101]: type(None) 

Out[101]: NoneType 

Dates and times 

The built-in Python datetime module provides datetime, date, and time types. The 
datetime type, as you may imagine, combines the information stored in date and 
time and is the most commonly used: 

In [102]: from import datetime, date, time 

In [103]: dt = datetime(2011, 10, 29, 20, 30, 21) 

In [104]: dt.day 
Out[104] : 29 

In [105]: dt.minute 
Out[105] : 30 

Given a datetime instance, you can extract the equivalent date and time objects by 
calling methods on the datetime of the same name: 

In [106]: dt.date() 

Out[106] : datetime. date(2011, 10, 29) 
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In [107]: dt.timeQ 

Out[107]: datetine. time(20, 30, 21) 

The strftime method formats a datetine as a string: 

In [108]: dt.strftime( '%n/%d/%Y %H:%M') 

Out[108] : '10/29/2011 20:30' 

Strings can be converted (parsed) into datetine objects with the strptine function: 

In [109]: datetine. strptlme( '20091031' , '%Y%n%d') 

Out[109]: datetine. datetlne(2009, 10, 31, 0, 0) 

See Table 2-5 for a full list of format specifications. 

When you are aggregating or otherwise grouping time series data, it will occasionally 
be useful to replace time fields of a series of datetines—for example, replacing the 
minute and second fields with zero: 

In [110]: dt.replace(nlnute=0, second=0) 

Out[110]: datetine. datetine(2011, 10, 29, 20, 0) 

Since datetine.datetine is an immutable type, methods like these always produce 
new objects. 

The difference of two datetine objects produces a datetine. tinedelta type: 

In [111]: dt2 = datetime(2011, 11, 15, 22, 30) 

In [112]: delta = dt2 - dt 
In [113]: delta 

0ut[113]: datetine.timedelta(17, 7179) 

In [114]: type(delta) 

0ut[114]: datetine.tinedelta 

The output tinedelta(17, 7179) indicates that the timedelta encodes an offset of 17 
days and 7,179 seconds. 

Adding a tinedelta to a datetine produces a new shifted datetine: 

In [115]: dt 

0ut[115] : datetine. datetine(2011, 10, 29, 20, 30, 21) 

In [116] : dt + delta 

0ut[116]: datetine. datetine(2011, 11, 15, 22, 30) 

Table 2-5. Datetime format specification (ISO C89 compatible) 


Type Description 


%Y Four-digit year 

%y Two-digit year 
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Type Description 


%m Two-digit month [01,12] 

%d Two-digit day [01,31] 

%H Hour (24-hour clock) [00,23] 

%I Hour (12-hour clock) [01,12] 

%M Two-digit minute [00,59] 

%S Second [00,61] (seconds 60,61 account for leap seconds) 

%w Weekday as integer [0 (Sunday), 6] 

%U Week number of the year [00,53]; Sunday is considered the first day of the week, and days before the first Sunday of 
the year are "week 0" 

%W Week number of the year [00,53]; Monday is considered the first day of the week, and days before the first Monday of 
the year are "week 0" 

%z UTC time zone offset as +HHMM or -HHMM; empty if time zone naive 
%F Shortcut for %Y-%m-%d (e.g., 2012-4-18) 

%D Shortcut for %m/%d/%y (e.g., 04/18/12) 


Control Flow 

Python has several built-in keywords for conditional logic, loops, and other standard 
control flow concepts found in other programming languages. 

if, elif, and else 

The if statement is one of the most well-known control flow statement types. It 
checks a condition that, if True, evaluates the code in the block that follows: 

if x < 0: 

printf'It's negative') 

An if statement can be optionally followed by one or more elif blocks and a catch¬ 
all else block if all of the conditions are False: 

if x < 0: 

printf'It's negative') 
elif x == 0 : 

print( 'Equal to zero') 
elif 0 < x < 5: 

print( 'Positive but smaller than 5') 
else: 

print( 'Positive and larger than or equal to 5') 

If any of the conditions is True, no further elif or else blocks will be reached. With 
a compound condition using and or or, conditions are evaluated left to right and will 
short-circuit: 

In [117]: a = 5; b = 7 
In [118]: c = 8; d = 4 
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In [119]: if a < b or c > d: 

.: print(' Made it 1 ) 

Made it 

In this example, the comparison c > d never gets evaluated because the first compar¬ 
ison was True. 

It is also possible to chain comparisons: 

In [120]: 4 > 3 > 2 > 1 
Out[120] : True 


for loops 

for loops are for iterating over a collection (like a list or tuple) or an iterater. The 
standard syntax for a for loop is: 

for value in collection: 

# do something with value 

You can advance a for loop to the next iteration, skipping the remainder of the block, 
using the continue keyword. Consider this code, which sums up integers in a list and 
skips None values: 

sequence = [1, 2, None, 4, None, 5] 
total = 0 

for value in sequence: 
if value is None: 

continue 
total += value 

A for loop can be exited altogether with the break keyword. This code sums ele¬ 
ments of the list until a 5 is reached: 

sequence = [1, 2, 0, 4, 6, 5, 2, 1] 
total_unti!_5 = 0 
for value in sequence: 
if value == 5: 
break 

total_until_5 += value 

The break keyword only terminates the innermost for loop; any outer for loops will 
continue to run: 


In [121] 


for i in range(4): 

for j in range(4): 
if j > i: 

break 

print( (i, j)) 


( 0 , 0 ) 
( 1 , 0 ) 
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( 1 , 1 ) 

( 2 , 0 ) 

( 2 , 1 ) 

( 2 , 2 ) 

(3, 0) 

(3, 1) 

(3, 2) 

(3, 3) 

As we will see in more detail, if the elements in the collection or iterator are sequen¬ 
ces (tuples or lists, say), they can be conveniently unpacked into variables in the for 
loop statement: 

for a, b, c in iterator: 

# do something 

while loops 

A while loop specifies a condition and a block of code that is to be executed until the 
condition evaluates to False or the loop is explicitly ended with break: 

x = 256 
total = 0 
while x > 0: 

if total > 500: 

break 

total += x 
x = x // 2 


pass 

pass is the “no-op” statement in Python. It can be used in blocks where no action is to 
be taken (or as a placeholder for code not yet implemented); it is only required 
because Python uses whitespace to delimit blocks: 

if x < 0: 

print( 'negative! ' ) 
elif x == 0: 

# TODO: put something smart here 

pass 

else: 

print( 'positive!' ) 


range 

The range function returns an iterator that yields a sequence of evenly spaced 
integers: 

In [122]: range(10) 

0ut[122]: range(0, 10) 
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In [123]: list(range(10)) 

Out[123]: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9] 

Both a start, end, and step (which may be negative) can be given: 

In [124]: tist(range(0, 20, 2)) 

0ut[124]: [0, 2, 4, 6, 8, 10, 12, 14, 16, 18] 

In [125]: list(range(5, 0, -1)) 

0ut[125]: [5, 4, 3, 2, 1] 

As you can see, range produces integers up to but not including the endpoint. A 
common use of range is for iterating through sequences by index: 

seq = [1, 2, 3, 4] 
for 1 in range(len(seq)): 
val = seq[i] 

While you can use functions like list to store all the integers generated by range in 
some other data structure, often the default iterator form will be what you want. This 
snippet sums all numbers from 0 to 99,999 that are multiples of 3 or 5: 

sun = 0 

for 1 in range(100000) : 

# % is the modulo operator 
if i % 3 == 0 or i % 5 == 0: 
sun += i 

While the range generated can be arbitrarily large, the memory use at any given time 
may be very small. 

Ternary expressions 

A ternary expression in Python allows you to combine an if-else block that pro¬ 
duces a value into a single line or expression. The syntax for this in Python is: 

value = true-expr if condition else false-expr 

Here, true-expr and false-expr can be any Python expressions. It has the identical 
effect as the more verbose: 

if condition: 

value = true-expr 
else: 

value = false-expr 
This is a more concrete example: 

In [126]: x = 5 

In [127]: 'Non-negative' if x >= 0 else 'Negative' 

0ut[127]: 'Non-negative' 


2.3 Python Language Basics | 49 



As with if-else blocks, only one of the expressions will be executed. Thus, the “if” 
and “else” sides of the ternary expression could contain costly computations, but only 
the true branch is ever evaluated. 

While it may be tempting to always use ternary expressions to condense your code, 
realize that you may sacrifice readability if the condition as well as the true and false 
expressions are very complex. 
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CHAPTER 3 


Built-in Data Structures, Functions, 

and Files 


This chapter discusses capabilities built into the Python language that will be used 
ubiquitously throughout the book. While add-on libraries like pandas and NumPy 
add advanced computational functionality for larger datasets, they are designed to be 
used together with Python’s built-in data manipulation tools. 

We’ll start with Python’s workhorse data structures: tuples, lists, diets, and sets. Then, 
we’ll discuss creating your own reusable Python functions. Finally, we’ll look at the 
mechanics of Python file objects and interacting with your local hard drive. 

3.1 Data Structures and Sequences 

Python’s data structures are simple but powerful. Mastering their use is a critical part 
of becoming a proficient Python programmer. 

Tuple 

A tuple is a fixed-length, immutable sequence of Python objects. The easiest way to 
create one is with a comma-separated sequence of values: 

In [1]: tup = 4, 5, 6 

In [2]: tup 
0ut[2] : (4, 5, 6) 

When you’re defining tuples in more complicated expressions, it’s often necessary to 
enclose the values in parentheses, as in this example of creating a tuple of tuples: 
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In [3]: nested_tup = (4, 5, 6), (7, 8) 


In [4]: nested_tup 

0ut[4] : ((4, 5, 6), (7, 8)) 

You can convert any sequence or iterator to a tuple by invoking tuple: 

In [5]: tuple([4, 0, 2]) 

0ut[5] : (4, 0, 2) 

In [6]: tup = tuple(' string 1 ) 

In [7]: tup 

0ut[7] : ('s', 't' , V, 'i', 'n', ' g 1 ) 

Elements can be accessed with square brackets [ ] as with most other sequence types. 
As in C, C++, Java, and many other languages, sequences are 0-indexed in Python: 

In [8]: tup[0] 

0ut[8] : 's' 

While the objects stored in a tuple may be mutable themselves, once the tuple is cre¬ 
ated it’s not possible to modify which object is stored in each slot: 

In [9]: tup = tuple([ 'foo' , [1, 2], True]) 

In [10]: tup[2] = False 

TypeError Traceback (most recent call last) 

<lpython-input-10-c7308343b841> in <module>() 

-> 1 tup[2] = False 

TypeError: 'tuple' object does not support item assignment 

If an object inside a tuple is mutable, such as a list, you can modify it in-place: 

In [11]: tup[l] . append(3) 

In [12]: tup 

0ut[12]: ('foo', [1, 2, 3], True) 

You can concatenate tuples using the + operator to produce longer tuples: 

In [13]: (4, None, 'foo') + (6, 0) + ('bar',) 

0ut[13]: (4, None, 'foo', 6, 0, 'bar') 

Multiplying a tuple by an integer, as with lists, has the effect of concatenating together 
that many copies of the tuple: 

In [14]: ('foo', 'bar') * 4 

0ut[14]: ('foo', 'bar', 'foo', 'bar', 'foo', 'bar', 'foo', 'bar') 

Note that the objects themselves are not copied, only the references to them. 
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Unpacking tuples 

If you try to assign to a tuple-like expression of variables, Python will attempt to 
unpack the value on the righthand side of the equals sign: 

In [15]: tup = (4, 5, 6) 

In [16] : a, b, c = tup 

In [17]: b 
0ut[17] : 5 

Even sequences with nested tuples can be unpacked: 

In [18]: tup = 4, 5, (6, 7) 

In [19]: a, b, (c, d) = tup 

In [20]: d 
Out[20] : 7 

Using this functionality you can easily swap variable names, a task which in many 
languages might look like: 

tmp = a 
a = b 
b = tmp 

But, in Python, the swap can be done like this: 

In [21]: a, b = 1, 2 

In [22]: a 
0ut[22] : 1 

In [23]: b 
0ut[23] : 2 

In [24]: b, a = a, b 

In [25]: a 
0ut[25] : 2 

In [26]: b 
0ut[26] : 1 

A common use of variable unpacking is iterating over sequences of tuples or lists: 

In [27]: seq = [(1, 2, 3), (4, 5, 6), (7, 8, 9)] 

In [28]: for a, b, c in seq: 

print( 'a={0}, b={l}, c={2}' ,format(a, b, c)) 

a=l, b=2, c=3 
a=4, b=5, c=6 
a=7, b=8, c=9 
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Another common use is returning multiple values from a function. I’ll cover this in 
more detail later. 

The Python language recently acquired some more advanced tuple unpacking to help 
with situations where you may want to “pluck” a few elements from the beginning of 
a tuple. This uses the special syntax *rest, which is also used in function signatures 
to capture an arbitrarily long list of positional arguments: 

In [29]: values = 1, 2, 3, 4, 5 

In [30]: a, b, *rest = values 

In [31]: a, b 
0ut[31] : (1, 2) 

In [32]: rest 
0ut[32] : [3, 4, 5] 

This rest bit is sometimes something you want to discard; there is nothing special 
about the rest name. As a matter of convention, many Python programmers will use 
the underscore (_) for unwanted variables: 

In [33]: a, b, *_ = values 

Tuple methods 

Since the size and contents of a tuple cannot be modified, it is very light on instance 
methods. A particularly useful one (also available on lists) is count, which counts the 
number of occurrences of a value: 

In [34]: a = (1, 2, 2, 2, 3, 4, 2) 

In [35]: a.count(2) 

0ut[35] : 4 

List 

In contrast with tuples, lists are variable-length and their contents can be modified 
in-place. You can define them using square brackets [] or using the list type func¬ 
tion: 

In [36]: a_list = [2, 3, 7, None] 

In [37]: tup = ('foo', 'bar', 'baz') 

In [38]: b_list = list(tup) 

In [39]: b_list 

0ut[39]: ['foo', 'bar', 'baz'] 

In [40]: b_list[l] = 'peekaboo 1 
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In [41]: b_list 

0ut[41]: ['foo', 'peekaboo 1 , 'baz' ] 

Lists and tuples are semantically similar (though tuples cannot be modified) and can 
be used interchangeably in many functions. 

The list function is frequently used in data processing as a way to materialize an 
iterator or generator expression: 

In [42]: gen = range(10) 

In [43]: gen 
0ut[43]: range(0, 10) 

In [44]: list(gen) 

0ut[44] : [0, 1, 2, 3, 4, 5, 6, 7, 8, 9] 

Adding and removing elements 

Elements can be appended to the end of the list with the append method: 

In [45]: b_list.append( 'dwarf' ) 

In [46]: b_list 

0ut[46]: ['foo', 'peekaboo', 'baz', 'dwarf'] 

Using insert you can insert an element at a specific location in the list: 

In [47]: b_list.insert(l, 'red') 

In [48]: b_list 

0ut[48]: ['foo', 'red', 'peekaboo', 'baz', 'dwarf'] 

The insertion index must be between 0 and the length of the list, inclusive. 



insert is computationally expensive compared with append, 
because references to subsequent elements have to be shifted inter¬ 
nally to make room for the new element. If you need to insert ele¬ 
ments at both the beginning and end of a sequence, you may wish 
to explore collections .deque, a double-ended queue, for this pur¬ 
pose. 


The inverse operation to insert is pop, which removes and returns an element at a 
particular index: 

In [49]: b_list. pop(2) 

0ut[49]: 'peekaboo' 

In [50]: b_list 

Out[50]: ['foo', 'red', 'baz', 'dwarf'] 


3.1 Data Structures and Sequences | 55 




Elements can be removed by value with remove, which locates the first such value and 
removes it from the last: 

In [51]: b_tist.append( 'foo' ) 

In [52]: b_Iist 

0ut[52]: ['foo', 'red', 'baz' , 'dwarf', 'foo'] 

In [53]: b_Iist.remove( 'foo' ) 

In [54]: b_Iist 

0ut[54]: ['red', 'baz', 'dwarf', 'foo'] 

If performance is not a concern, by using append and remove, you can use a Python 
list as a perfectly suitable “multiset” data structure. 

Check if a list contains a value using the in keyword: 

In [55]: 'dwarf' in b_list 
0ut[55]: True 

The keyword not can be used to negate in: 

In [56]: 'dwarf' not in b_list 
0ut[56]: False 

Checking whether a list contains a value is a lot slower than doing so with diets and 
sets (to be introduced shortly), as Python makes a linear scan across the values of the 
list, whereas it can check the others (based on hash tables) in constant time. 

Concatenating and combining lists 

Similar to tuples, adding two lists together with + concatenates them: 

In [57]: [4, None, 'foo'] + [7, 8, (2, 3)] 

0ut[57]: [4, None, 'foo', 7, 8, (2, 3)] 

If you have a list already defined, you can append multiple elements to it using the 
extend method: 

In [58]: x = [4, None, 'foo'] 

In [59]: x.extend([7, 8, (2, 3)]) 

In [60]: x 

Out[60]: [4, None, 'foo', 7, 8, (2, 3)] 

Note that list concatenation by addition is a comparatively expensive operation since 
a new list must be created and the objects copied over. Using extend to append ele¬ 
ments to an existing list, especially if you are building up a large list, is usually pref¬ 
erable. Thus, 
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everything = [] 
for chunk in list_of_llsts : 
everything.extend(chunk) 

is faster than the concatenative alternative: 

everything = [] 

for chunk in list_of_lists: 

everything = everything + chunk 


Sorting 

You can sort a list in-place (without creating a new object) by calling its sort 
function: 

In [61]: a = [7, 2, 5, 1, 3] 

In [62]: a.sort() 

In [63]: a 

0ut[63] : [1, 2, 3, 5, 7] 

sort has a few options that will occasionally come in handy. One is the ability to pass 
a secondary sort key —that is, a function that produces a value to use to sort the 
objects. For example, we could sort a collection of strings by their lengths: 

In [64]: b = ['saw', 'snail', 'He', 'foxes', ’six'] 

In [65]: b.sort(key=len) 

In [66]: b 

0ut[66]: ['He', 'saw', 'six', 'small', 'foxes'] 

Soon, we’ll look at the sorted function, which can produce a sorted copy of a general 
sequence. 

Binary search and maintaining a sorted list 

The built-in bisect module implements binary search and insertion into a sorted list, 
bisect. bisect finds the location where an element should be inserted to keep it sor¬ 
ted, while bisect. Insort actually inserts the element into that location: 

In [67]: import 

In [68]: c = [1, 2, 2, 2, 3, 4, 7] 

In [69]: bisect. bisect(c, 2) 

0ut[69] : 4 

In [70]: bisect. bisect(c, 5) 

Out[70] : 6 
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In [71]: bisect.insort(c, 6) 


In [72]: c 

0ut[72] : [1, 2 , 2 , 2 , 3 , 4, 6, 7] 



The bisect module functions do not check whether the list is sor¬ 
ted, as doing so would be computationally expensive. Thus, using 
them with an unsorted list will succeed without error but may lead 
to incorrect results. 


Slicing 

You can select sections of most sequence types by using slice notation, which in its 
basic form consists of start: stop passed to the indexing operator []: 

In [73]: seq = [7, 2 , 3, 7, 5, 6, 0, 1] 

In [74]: seq[l:5] 

0ut[74] : [2, 3, 1 , 5] 

Slices can also be assigned to with a sequence: 

In [75]: seq[3:4] = [6, 3] 

In [76]: seq 

0ut[76] : [7, 2 , 3, 6, 3, 5, 6, 0, 1] 

While the element at the start index is included, the stop index is not included, so 
that the number of elements in the result is stop - start. 

Either the start or stop can be omitted, in which case they default to the start of the 
sequence and the end of the sequence, respectively: 

In [77]: seq[:5] 

0ut[77] : [7, 2, 3, 6, 3] 

In [78]: seq[3:] 

0ut[78] : [6, 3, 5, 6, 0, 1] 

Negative indices slice the sequence relative to the end: 

In [79]: seq[ 4: ] 

0ut[79] : [5, 6, 0, 1] 

In [80]: seq[ 6:-2] 

Out[80] : [6, 3, 5, 6] 

Slicing semantics takes a bit of getting used to, especially if you’re coming from R or 
MATLAB. See Figure 3-1 for a helpful illustration of slicing with positive and nega¬ 
tive integers. In the figure, the indices are shown at the “bin edges” to help show 
where the slice selections start and stop using positive or negative indices. 
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A step can also be used after a second colon to, say, take every other element: 

In [81]: seq[ ::2] 

0ut[81] : [7, 3, 3, 6, 1] 

A clever use of this is to pass -1, which has the useful effect of reversing a list or tuple: 
In [82]: seq[ ::-1] 

0ut[82] : [1, 0, 6, 5, 3, 6, 3, 2, 7] 
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Figure 3-1. Illustration of Python slicing conventions 

Built-in Sequence Functions 

Python has a handful of useful sequence functions that you should familiarize your¬ 
self with and use at any opportunity. 

enumerate 

It’s common when iterating over a sequence to want to keep track of the index of the 
current item. A do-it-yourself approach would look like: 

1 = 0 

for value in collection: 

# do something with value 
i += 1 

Since this is so common, Python has a built-in function, enumerate, which returns a 
sequence of (i, value) tuples: 

for 1 , value In enumerate(collection) : 

# do something with value 

When you are indexing data, a helpful pattern that uses enumerate is computing a 
diet mapping the values of a sequence (which are assumed to be unique) to their 
locations in the sequence: 

In [83]: some_list = ['foo', 'bar', ’baz 1 ] 

In [84]: mapping = {} 
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In [85]: for i, v in enumerate(some_list) : 

....: mappingfv] = i 

In [86]: napping 

0ut[86]: {'bar': 1, 'baz': 2, 'foo': 0} 

sorted 

The sorted function returns a new sorted list from the elements of any sequence: 

In [87]: sorted([7, 1, 2, 6, 0, 3, 2]) 

0ut[87] : [0, 1, 2, 2, 3, 6, 7] 

In [88]: sorted( 'horse race') 

0ut[88] : [' ', 'a', 'c', 'e' , 'e', 'h', 'o’, ’r', 'r 1 , ’s'] 

The sorted function accepts the same arguments as the sort method on lists. 


zip “pairs” up the elements of a number of lists, tuples, or other sequences to create a 
list of tuples: 

In [89]: seql = ['foo 1 , 'bar', 'baz'] 

In [90]: seq2 = ['one 1 , 'two', 'three'] 

In [91]: zipped = zip(seql, seq2) 

In [92]: list(zipped) 

0ut[92]: [('foo', 'one'), ('bar', 'two'), ('baz', 'three')] 

zip can take an arbitrary number of sequences, and the number of elements it pro¬ 
duces is determined by the shortest sequence: 

In [93]: seq3 = [False, True] 


In [94]: list(zip(seql, seq2, seq3)) 

0ut[94]: [('foo', 'one'. False), ('bar', 'two'. True)] 

A very common use of zip is simultaneously iterating over multiple sequences, possi¬ 
bly also combined with enumerate: 


In [95] 


for i, (a, b) in enumerate(zip(seql, seq2)): 
print('{0}: {1}, {2}'. format(i, a, b)) 


0: foo, one 
1: bar, two 
2: baz, three 
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Given a “zipped” sequence, zip can be applied in a clever way to “unzip” the 
sequence. Another way to think about this is converting a list of rows into a list of 
columns. The syntax, which looks a bit magical, is: 

In [96]: pitchers = [('Nolan', 'Ryan'), ('Roger', ’Clemens'), 

('Schilling', 'Curt')] 

In [97]: first_names, last_names = zip(*pitchers) 

In [98]: first_names 

0ut[98]: ('Nolan 1 , 'Roger', 'Schilling') 

In [99]: last_names 

0ut[99]: ('Ryan', 'Clemens', 'Curt') 

reversed 

reversed iterates over the elements of a sequence in reverse order: 

In [100]: list(reversed(range(10))) 

Out[100] : [9, 8, 7, 6, 5, 4, 3, 2, 1, 0] 

Keep in mind that reversed is a generator (to be discussed in some more detail later), 
so it does not create the reversed sequence until materialized (e.g., with list or a for 
loop). 

diet 

diet is likely the most important built-in Python data structure. A more common 
name for it is hash map or associative array. It is a flexibly sized collection of key-value 
pairs, where key and value are Python objects. One approach for creating one is to use 
curly braces {} and colons to separate keys and values: 

In [101]: empty_dlct = {} 

In [102]: dl = {'a' : 'some value', ’b’ : [1, 2, 3, 4]} 

In [103]: dl 

Out[103]: {'a': 'some value', 'b' : [1, 2, 3, 4]} 

You can access, insert, or set elements using the same syntax as for accessing elements 
of a list or tuple: 

In [104]: d1 [7] = 'an Integer' 

In [105]: dl 

Out[105]: {'a': 'some value', 'b' : [1, 2, 3, 4], 7: 'an Integer'} 

In [106]: dl[' b' ] 

Out[106] : [1, 2, 3, 4] 
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You can check if a diet contains a key using the same syntax used for checking 
whether a list or tuple contains a value: 

In [107]: 'b' In dl 

Out[107]: True 

You can delete values either using the del keyword or the pop method (which simul¬ 
taneously returns the value and deletes the key): 

In [108]: d1 [ 5 ] = 'some value' 

In [109]: dl 

Out[109]: 

{ 1 a' : ' some value' , 

'b' : [1, 2, 3, 4], 

7: 1 an Integer 1 , 

5: 'some value'} 

In [110]: dl[ 'dummy'] = 'another value' 

In [111]: dl 

Out[lll]: 

{ 'a' : ' some value' , 

'b' : [1, 2, 3, 4], 

7: 'an Integer' , 

5: 'some value' , 

'dummy': 'another value'} 

In [112]: del dl[5] 

In [113]: dl 

0ut[113]: 

{ 'a' : ' some value' , 

'b' : [1, 2, 3, 4], 

7: 'an integer' , 

'dummy': 'another value'} 

In [114]: ret = dl.pop( 'dummy' ) 

In [115]: ret 

0ut[115]: 'another value' 

In [116]: dl 

0ut[116]: {'a': 'some value', 'b' : [1, 2, 3, 4], 7: 'an integer'} 

The keys and values method give you iterators of the diet’s keys and values, respec¬ 
tively. While the key-value pairs are not in any particular order, these functions out¬ 
put the keys and values in the same order: 

In [117]: list(dl.keysO) 

0ut[117]: [ 'a' , 'b’ , 7] 
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In [118]: list(dl.values()) 

0ut[118]: ['some value', [1, 2, 3, 4], 'an integer'] 

You can merge one diet into another using the update method: 

In [119]: dl.update({ 'b' : 'foo', 'c' : 12}) 

In [120]: dl 

Out[120]: {'a': 'some value', 'b' : 'foo', 7: 'an integer', 'c' : 12} 

The update method changes diets in-place, so any existing keys in the data passed to 
update will have their old values discarded. 

Creating diets from sequences 

It’s common to occasionally end up with two sequences that you want to pair up 
element-wise in a diet. As a first cut, you might write code like this: 

mapping = {} 

for key, value in zip(key_list, value_list): 
mapping[key] = value 

Since a diet is essentially a collection of 2-tuples, the diet function accepts a list of 
2-tuples: 

In [121]: mapping = dict(zip(range(5), reversed(range(5)))) 

In [122]: mapping 

0ut[122]: {0: 4, 1: 3, 2: 2, 3: 1, 4: 0} 

Later we’ll talk about diet comprehensions, another elegant way to construct diets. 

Default values 

It’s very common to have logic like: 

if key in some_dict: 

value = some_dict[key] 
else: 

value = default_value 

Thus, the diet methods get and pop can take a default value to be returned, so that 
the above If-else block can be written simply as: 

value = some_dict.get(key, default_value) 

get by default will return None if the key is not present, while pop will raise an excep¬ 
tion. With setting values, a common case is for the values in a diet to be other collec¬ 
tions, like lists. For example, you could imagine categorizing a list of words by their 
first letters as a diet of lists: 

In [123]: words = ['apple', 'bat', 'bar', 'atom', 'book'] 

In [124]: by_letter = {} 
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In [125] 


for word in words: 
letter = word[0] 
if letter not in by_letter: 

by_letter[letter] = [word] 
else: 

by_letter[letter].append(word) 


In [126]: by_letter 

0ut[126]: {'a': ['apple', 'atom'], 'b': ['bat 1 , 'bar', 'book']} 

The setdefault diet method is for precisely this purpose. The preceding for loop 
can be rewritten as: 

for word in words: 
letter = word[0] 

by_letter.setdefault (letter, []). append(word) 

The built-in collections module has a useful class, defaultdict, which makes this 
even easier. To create one, you pass a type or function for generating the default value 
for each slot in the diet: 

from import defaultdict 

by_letter = defaultdict(list) 
for word in words: 

by_letter[word[0]]. append(word) 

Valid diet key types 

While the values of a diet can be any Python object, the keys generally have to be 
immutable objects like scalar types (int, float, string) or tuples (all the objects in the 
tuple need to be immutable, too). The technical term here is hashability. You can 
check whether an object is hashable (can be used as a key in a diet) with the hash 
function: 

In [127]: hash( 'string 1 ) 

0ut[127]: 5023931463650008331 

In [128]: hash((l, 2, (2, 3))) 

0ut[128]: 1097636502276347782 

In [129]: hash((l, 2, [2, 3])) # fails because lists are mutable 


TypeError Traceback (most recent call last) 

<ipython-input-129-800cdl4ba8be> in <module>() 

-> 1 hash((l, 2, [2, 3])) # fails because lists are mutable 

TypeError: unhashable type: 'list' 
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To use a list as a key, one option is to convert it to a tuple, which can be hashed as 
long as its elements also can: 

In [130]: d = {} 

In [131]: d[tuple( [1, 2, 3])] = 5 
In [132]: d 

0ut[132]: {(1, 2, 3): 5} 


A set is an unordered collection of unique elements. You can think of them like diets, 
but keys only, no values. A set can be created in two ways: via the set function or via 
a set literal with curly braces: 

In [133]: set([2, 2, 2, 1, 3, 3]) 

0ut[133]: {1, 2, 3} 

In [134]: {2, 2, 2, 1, 3, 3} 

0ut[134]: {1, 2, 3} 

Sets support mathematical set operations like union, intersection, difference, and 
symmetric difference. Consider these two example sets: 

In [135]: a = {1, 2, 3, 4, 5} 

In [136]: b = {3, 4, 5, 6, 7, 8} 

The union of these two sets is the set of distinct elements occurring in either set. This 
can be computed with either the union method or the | binary operator: 

In [137]: a.union(b) 

0ut[137]: {1, 2, 3, 4, 5, 6, 7, 8} 

In [138]: a | b 

0ut[138]: {1, 2, 3, 4, 5, 6, 7, 8} 

The intersection contains the elements occurring in both sets. The & operator or the 
intersection method can be used: 

In [139]: a.intersection(b) 

0ut[139]: {3, 4, 5} 

In [140]: a & b 
Out[140]: {3, 4, 5} 

See Table 3-1 for a list of commonly used set methods. 
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Table 3-1. Python set operations 


Function 

Alternative 

syntax 

Description 

a.add(x) 

N/A 

Add element x to the set a 

a.clear() 

N/A 

Reset the set a to an empty state, discarding all of 
its elements 

a.remove(x) 

N/A 

Remove element x from the set a 

a.popQ 

N/A 

Remove an arbitrary element from the set a, raising 
KeyError ifthesetis empty 

a.unlon(b) 

a | b 

All of the unique elements in a and b 

a.update(b) 

a |= b 

Set the contents of a to be the union of the 

elements in a and b 

a.intersectlon(b) 

a & b 

All of the elements in both a and b 

a .lntersectlon_update(b) 

a &= b 

Set the contents of a to be the intersection of the 

elements in a and b 

a.difference(b) 

a - b 

The elements in a that are not in b 

a.difference_update(b) 

a -= b 

Set a to the elements in a that are not in b 

a. symmetric_difference(b) 

a A b 

All of the elements in either a or b but not both 

a. symmetric_difference_update(b) 

a A = b 

Set a to contain the elements in either a or b but 
not both 

a.issubset(b) 

N/A 

True if the elements of a are all contained in b 

a.issuperset(b) 

N/A 

True if the elements of bare all contained in a 

a.isdisjoint(b) 

N/A 

True if a and b have no elements in common 

All of the logical set operations have in- 

place counterparts, which enable you to 

replace the contents of the set on the left side of the operation with the result. For 

very large sets, this may be more 

efficient: 


In [141] : c = a.copyO 



In [142]: c |= b 



In [143]: c 

0ut[143] : {1, 2, 3, 4, 5, 6, 

7, 8} 


In [144]: d = a.copyO 



In [145]: d &= b 



In [146]: d 

0ut[146] : {3, 4, 5} 



Like diets, set elements generally must be immutable. To have list-like elements, you 

must convert it to a tuple: 
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In [147]: my_data = [1, 2, 3, 4] 

In [148]: my_set = {tuple(my_data)} 

In [149]: my_set 
0ut[149]: {(1, 2, 3, 4)} 

You can also check if a set is a subset of (is contained in) or a superset of (contains all 
elements of) another set: 

In [150]: a_set = {1, 2, 3, 4, 5} 

In [151]: {1, 2, 3}.tssubset(a_set) 

0ut[151]: True 

In [152]: a_set.issuperset({l, 2, 3}) 

0ut[152]: True 

Sets are equal if and only if their contents are equal: 

In [153]: {1, 2, 3} == {3, 2, 1} 

0ut[153]: True 

List, Set, and Diet Comprehensions 

List comprehensions are one of the most-loved Python language features. They allow 
you to concisely form a new list by filtering the elements of a collection, transforming 
the elements passing the filter in one concise expression. They take the basic form: 

[expr for val tn collection if condition ] 

This is equivalent to the following for loop: 

result = [] 

for val in collection: 
if condition: 

result.append(expr) 

The filter condition can be omitted, leaving only the expression. For example, given a 
list of strings, we could filter out strings with length 2 or less and also convert them to 
uppercase like this: 

In [154]: strings = ['a', 'as', 'bat', 'car', 'dove', 'python'] 

In [155]: [x.upperQ for x in strings if len(x) > 2] 

0ut[155]: ['BAT', 'CAR', 'DOVE', 'PYTHON'] 

Set and diet comprehensions are a natural extension, producing sets and diets in an 
idiomatically similar way instead of lists. A diet comprehension looks like this: 

dict_comp = {key-expr : value-expr for value in collection 
if condition } 
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A set comprehension looks like the equivalent list comprehension except with curly 
braces instead of square brackets: 

set_comp = {expr for value tn collection If condition } 

Like list comprehensions, set and diet comprehensions are mostly conveniences, but 
they similarly can make code both easier to write and read. Consider the list of strings 
from before. Suppose we wanted a set containing just the lengths of the strings con¬ 
tained in the collection; we could easily compute this using a set comprehension: 

In [156]: unique_lengths = (len(x) for x in strings} 

In [157]: unique_lengths 
0ut[157]: {1, 2, 3, 4, 6} 

We could also express this more functionally using the nap function, introduced 
shortly: 

In [158]: set(map(ten, strings)) 

0ut[158]: {1, 2, 3, 4, 6} 

As a simple diet comprehension example, we could create a lookup map of these 
strings to their locations in the list: 

In [159]: loc_mapping = {val : index for index, val in enumerate(strings)} 

In [160]: loc_mapping 

Out[160]: {'a': 0, 'as': 1, 'bat': 2, 'car': 3, 'dove': 4, 'python': 5} 

Nested list comprehensions 

Suppose we have a list of lists containing some English and Spanish names: 

In [161]: all_data = [['John', 'Emily', 'Michael', 'Mary', 'Steven'], 

: ['Maria', 'Juan', 'Javier', 'Natalia', 'Pilar']] 

You might have gotten these names from a couple of files and decided to organize 
them by language. Now, suppose we wanted to get a single list containing all names 
with two or more e s in them. We could certainly do this with a simple for loop: 

names_of_interest = [] 
for names in all_data: 

enough_es = [name for name in names if name.count(' e' ) >= 2] 
names_of_interest.extend(enough_es) 

You can actually wrap this whole operation up in a single nested list comprehension, 
which will look like: 

In [162]: result = [name for names in all_data for name in names 
.: if name.count( 'e' ) >= 2] 


In [163]: result 
0ut[163]: ['Steven'] 
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At first, nested list comprehensions are a bit hard to wrap your head around. The for 
parts of the list comprehension are arranged according to the order of nesting, and 
any filter condition is put at the end as before. Here is another example where we 
“flatten” a list of tuples of integers into a simple list of integers: 

In [164]: some_tuples = [(1, 2, 3), (4, 5, 6), (7, 8, 9)] 

In [165]: flattened = [x for tup in some_tuples for x in tup] 

In [166]: flattened 

0ut[166]: [1, 2, 3, 4, 5, 6, 7, 8, 9] 

Keep in mind that the order of the for expressions would be the same if you wrote a 
nested for loop instead of a list comprehension: 

flattened = [] 

for tup in some_tuples : 
for x in tup: 

flattened. append(x) 

You can have arbitrarily many levels of nesting, though if you have more than two or 
three levels of nesting you should probably start to question whether this makes sense 
from a code readability standpoint. It’s important to distinguish the syntax just shown 
from a list comprehension inside a list comprehension, which is also perfectly valid: 

In [167]: [[x for x in tup] for tup in some_tuples] 

0ut[167]: [[1, 2, 3], [4, 5, 6], [7, 8, 9]] 

This produces a list of lists, rather than a flattened list of all of the inner elements. 

3.2 Functions 

Functions are the primary and most important method of code organization and 
reuse in Python. As a rule of thumb, if you anticipate needing to repeat the same or 
very similar code more than once, it may be worth writing a reusable function. Func¬ 
tions can also help make your code more readable by giving a name to a group of 
Python statements. 

Functions are declared with the def keyword and returned from with the return key¬ 
word: 

def my_function(x, y, z=1.5): 
if z > 1: 

return z * (x + y) 

else: 

return z / (x + y) 
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There is no issue with having multiple return statements. If Python reaches the end 
of a function without encountering a return statement, None is returned automati¬ 
cally. 

Each function can have positional arguments and keyword arguments. Keyword argu¬ 
ments are most commonly used to specify default values or optional arguments. In 
the preceding function, x and y are positional arguments while z is a keyword argu¬ 
ment. This means that the function can be called in any of these ways: 

my_function(5, 6, z=0.7) 
ny_function(3. 14, 7, 3.5) 
my_function(10, 20) 

The main restriction on function arguments is that the keyword arguments must fol¬ 
low the positional arguments (if any). You can specify keyword arguments in any 
order; this frees you from having to remember which order the function arguments 
were specified in and only what their names are. 



It is possible to use keywords for passing positional arguments as 
well. In the preceding example, we could also have written: 

my_function(x=5, y=6, z=7) 
my_functlon(y=6, x=5, z=7) 

In some cases this can help with readability. 


Namespaces, Scope, and Local Functions 

Functions can access variables in two different scopes: global and local. An alternative 
and more descriptive name describing a variable scope in Python is a namespace. Any 
variables that are assigned within a function by default are assigned to the local 
namespace. The local namespace is created when the function is called and immedi¬ 
ately populated by the functions arguments. After the function is finished, the local 
namespace is destroyed (with some exceptions that are outside the purview of this 
chapter). Consider the following function: 

def func(): 
a = [] 

for i in range(5): 
a.append(i) 

When func() is called, the empty list a is created, five elements are appended, and 
then a is destroyed when the function exits. Suppose instead we had declared a as 
follows: 

a = [] 
def func(): 

for i in range(5): 
a.append(i) 
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Assigning variables outside of the functions scope is possible, but those variables 
must be declared as global via the global keyword: 

In [168] : a = None 


In [169] 


def bind_a_vartabte() : 

global a 
a = [] 

btnd_a_vartable() 


In [170]: print(a) 

[] 



I generally discourage use of the global keyword. Typically global 
variables are used to store some kind of state in a system. If you 
find yourself using a lot of them, it may indicate a need for object- 
oriented programming (using classes). 


Returning Multiple Values 

When I first programmed in Python after having programmed in Java and C++, one 
of my favorite features was the ability to return multiple values from a function with 
simple syntax. Here’s an example: 

def f() : 
a = 5 
b = 6 
c = 7 

return a, b, c 
a, b, c = f() 

In data analysis and other scientific applications, you may find yourself doing this 
often. What’s happening here is that the function is actually just returning one object, 
namely a tuple, which is then being unpacked into the result variables. In the preced¬ 
ing example, we could have done this instead: 

return_value = f() 

In this case, return_value would be a 3-tuple with the three returned variables. A 
potentially attractive alternative to returning multiple values like before might be to 
return a diet instead: 

def f() : 
a = 5 
b = 6 
c = 7 

return {'a' : a, 'b' : b, 'c' : c} 
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This alternative technique can be useful depending on what you are trying to do. 

Functions Are Objects 

Since Python functions are objects, many constructs can be easily expressed that are 
difficult to do in other languages. Suppose we were doing some data cleaning and 
needed to apply a bunch of transformations to the following list of strings: 

In [171]: states = [' Alabama 'Georgia!', 'Georgia', 'georgia', 'Florida', 
.: 'south Carolina##', 'West Virginia?'] 

Anyone who has ever worked with user-submitted survey data has seen messy results 
like these. Lots of things need to happen to make this list of strings uniform and 
ready for analysis: stripping whitespace, removing punctuation symbols, and stand¬ 
ardizing on proper capitalization. One way to do this is to use built-in string methods 
along with the re standard library module for regular expressions: 

import re 

def cleanstrings(strings) : 
result = [] 
for value in strings: 

value = value. strip() 
value = re.sub( '[!#?]' , '', value) 
value = value. title() 
result.append (value) 
return result 

The result looks like this: 

In [173]: clean_strings(states) 

0ut[173]: 

[ 'Alabama' , 

1 Georgia 1 , 

1 Georgia 1 , 

1 Georgia 1 , 

1 Florida' , 

’South Carolina’, 

’West Virginia’ ] 

An alternative approach that you may find useful is to make a list of the operations 
you want to apply to a particular set of strings: 

def remove_punctuation(value) : 

return re. sub( ’[!#?]' , '', value) 

clean_ops = [str. strip, remove_punctuation, str. title] 

def clean_strings(strings, ops): 
result = [] 
for value in strings: 
for function in ops: 
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value = function(value) 
result.append(value) 
return result 

Then we have the following: 

In [175]: clean_strlngs(states, clean_ops) 

0ut[175]: 

[ 'Alabama' , 

1 Georgia 1 , 

’Georgia 1 , 

1 Georgia 1 , 

1 Florida' , 

’South Carolina’, 

’West Virginia’ ] 

A more functional pattern like this enables you to easily modify how the strings are 
transformed at a very high level. The clean_strings function is also now more reus¬ 
able and generic. 

You can use functions as arguments to other functions like the built-in nap function, 
which applies a function to a sequence of some kind: 

In [176]: for x in map(remove_punctuatlon, states): 

.: print(x) 

Alabama 

Georgia 

Georgia 

georgla 

Florida 

south Carolina 
West Virginia 

Anonymous (Lambda) Functions 

Python has support for so-called anonymous or lambda functions, which are a way of 
writing functions consisting of a single statement, the result of which is the return 
value. They are defined with the lambda keyword, which has no meaning other than 
“we are declaring an anonymous function”: 

def short_functlon(x) : 

return x * 2 

equiv_anon = lambda x: x * 2 

I usually refer to these as lambda functions in the rest of the book. They are especially 
convenient in data analysis because, as you’ll see, there are many cases where data 
transformation functions will take functions as arguments. It’s often less typing (and 
clearer) to pass a lambda function as opposed to writing a full-out function declara¬ 
tion or even assigning the lambda function to a local variable. For example, consider 
this silly example: 
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def apply_to_li.st(some_li.st, f): 

return [f(x) for x in some_llst] 

ints = [4, 0, 1, 5, 6] 
apply_to_list(ints, lambda x: x * 2) 

You could also have written [x * 2 for x in ints], but here we were able to suc¬ 
cinctly pass a custom operator to the apply_to_list function. 

As another example, suppose you wanted to sort a collection of strings by the number 
of distinct letters in each string: 

In [177]: strings = ['foo', 'card', ’bar’, 'aaaa', 'abab' ] 

Here we could pass a lambda function to the list’s sort method: 

In [178]: strings. sort(key=lambda x: len(set(list(x)))) 

In [179]: strings 

0ut[179]: ['aaaa', 'foo', 'abab', 'bar', 'card'] 



One reason lambda functions are called anonymous functions is 
that, unlike functions declared with the def keyword, the function 
object itself is never given an explicit_ name _attribute. 


Currying: Partial Argument Application 

Currying is computer science jargon (named after the mathematician Haskell Curry) 
that means deriving new functions from existing ones by partial argument applica¬ 
tion. For example, suppose we had a trivial function that adds two numbers together: 

def add_numbers(x, y): 

return x + y 

Using this function, we could derive a new function of one variable, add_five, that 
adds 5 to its argument: 

add_five = lambda y: add_numbers(5, y) 

The second argument to add_numbers is said to be curried. There’s nothing very fancy 
here, as all we’ve really done is define a new function that calls an existing function. 
The built-in functools module can simplify this process using the partial function: 

from Import partial 

add_five = partial(add_numbers, 5) 
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Generators 

Having a consistent way to iterate over sequences, like objects in a list or lines in a 
file, is an important Python feature. This is accomplished by means of the iterator 
protocol, a generic way to make objects iterable. For example, iterating over a diet 
yields the diet keys: 

In [180]: some_dict = {'a': 1, 'b' : 2, ' c' : 3} 

In [181]: for key in sorne_dtct: 

.: print(key) 

a 

b 

c 

When you write for key in some_dict, the Python interpreter first attempts to cre¬ 
ate an iterator out of some_dict: 

In [182]: dict_Lterator = iter(some_dict) 

In [183]: dlct_iterator 

0ut[183]: <dict_keyiterator at 0x7fbbd5a9f908> 

An iterator is any object that will yield objects to the Python interpreter when used in 
a context like a for loop. Most methods expecting a list or list-like object will also 
accept any iterable object. This includes built-in methods such as min, max, and sum, 
and type constructors like list and tuple: 

In [184]: tist(dict_iterator) 

0ut[184]: ['a', 1 b 1 , 'c' ] 

A generator is a concise way to construct a new iterable object. Whereas normal func¬ 
tions execute and return a single result at a time, generators return a sequence of 
multiple results lazily, pausing after each one until the next one is requested. To create 
a generator, use the yield keyword instead of return in a function: 

def squares(n=10) : 

print( 'Generating squares from 1 to {0}'. format(n ** 2)) 
for i in range(l, n + 1): 
yield i ** 2 

When you actually call the generator, no code is immediately executed: 

In [186]: gen = squaresQ 
In [187]: gen 

0ut[187]: <generator object squares at 0x7fbbd5ab4570> 

It is not until you request elements from the generator that it begins executing its 
code: 
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In [188] : for x in gen: 

.: print(x, end=' ') 

Generating squares from 1 to 100 
1 4 9 16 25 36 49 64 81 100 

Generator expresssions 

Another even more concise way to make a generator is by using a generator expres¬ 
sion. This is a generator analogue to list, diet, and set comprehensions; to create one, 
enclose what would otherwise be a list comprehension within parentheses instead of 
brackets: 

In [189]: gen = (x ** 2 for x in range(100)) 

In [190]: gen 

Out[190]: <generator object <genexpr> at 0x7fbbd5ab29e8> 

This is completely equivalent to the following more verbose generator: 

def _make_gen( ): 

for x in range(100): 
yield x ** 2 
gen = _make_gen() 

Generator expressions can be used instead of list comprehensions as function argu¬ 
ments in many cases: 

In [191]: sum(x ** 2 for x in range(100)) 

0ut[191]: 328350 

In [192]: dict((i, i **2) for i in range(5)) 

0ut[192]: {0: 0, 1: 1, 2: 4, 3: 9, 4: 16} 

itertools module 

The standard library itertools module has a collection of generators for many com¬ 
mon data algorithms. For example, groupby takes any sequence and a function, 
grouping consecutive elements in the sequence by return value of the function. Here’s 
an example: 

In [193]: import 

In [194]: first_letter = lambda x: x[0] 

In [195]: names = ['Alan', 'Adam', 'Wes', 'Will', 'Albert', 'Steven'] 

In [196]: for letter, names in itertools.groupby(names, first_letter) : 

.: print(letter, list(names)) # names is a generator 

A [ 'Alan' , 'Adam' ] 

W [ 'Wes' , 'Will' ] 

A [ 'Albert' ] 

S ['Steven'] 
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See Table 3-2 for a list of a few other itertools functions I’ve frequently found help¬ 
ful. You may like to check out the official Python documentation for more on this 
useful built-in utility module. 

Table 3-2. Some useful itertools functions 


Function Description 


combtnatlons(lterable, k) Generates a sequence of all possible k-tuples of elements in the iterable, 

ignoring order and without replacement (see also the companion function 
combtnatlons_wlth_replacement) 

permutatlons(lterable, k) Generates a sequence of all possible k-tuples of elements in the iterable, 

respecting order 

groupby(iterabte[, keyfunc]) Generates (key, sub-iterator) for each unique key 
product (Hterables, repeat=l) Generates the Cartesian product of the input iterables as tuples, similar to a 

nested for loop 


Errors and Exception Handling 

Handling Python errors or exceptions gracefully is an important part of building 
robust programs. In data analysis applications, many functions only work on certain 
kinds of input. As an example, Pythons float function is capable of casting a string 
to a floating-point number, but fails with ValueError on improper inputs: 

In [197]: float( '1.2345' ) 

0ut[197]: 1.2345 

In [198]: ftoat( 'something' ) 


ValueError Traceback (most recent call last) 

<lpython-lnput-198-439904410854> in <module>() 

-> 1 float( 1 something' ) 

ValueError: could not convert string to float: 'something' 

Suppose we wanted a version of float that fails gracefully, returning the input argu¬ 
ment. We can do this by writing a function that encloses the call to float in a try/ 
except block: 

def attempt_float(x) : 

try: 

return float(x) 
except: 

return x 

The code in the except part of the block will only be executed if float(x) raises an 
exception: 

In [200]: attempt_float( '1.2345' ) 

Out[200]: 1.2345 
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In [201]: attenpt_float( 'something' ) 

Out[201]: 'something' 

You might notice that float can raise exceptions other than ValueError: 

In [202]: float((l, 2)) 


TypeError Traceback (most recent call last) 

<ipython-input-202-842079ebb635> in <module>() 

-> 1 float((l, 2)) 

TypeError: float() argument must be a string or a number, not 'tuple' 

You might want to only suppress ValueError, since a TypeError (the input was not a 
string or numeric value) might indicate a legitimate bug in your program. To do that, 
write the exception type after except: 

def attempt_float(x) : 

try: 

return float(x) 
except ValueError: 
return x 

We have then: 

In [204]: attempt_float((l, 2)) 


TypeError Traceback (most recent call last) 

<ipython-input-204-9bdfd730cead> in <module>() 

-> 1 attempt_float((l, 2)) 

<ipython-input-203-3e06b8379b6b> in attempt_float(x) 

1 def attempt_float(x) : 

2 try: 

-> 3 return float(x) 

4 except ValueError: 

5 return x 

TypeError: float() argument must be a string or a number, not 'tuple' 

You can catch multiple exception types by writing a tuple of exception types instead 
(the parentheses are required): 

def attempt_float(x) : 

try: 

return float(x) 

except (TypeError, ValueError): 
return x 

In some cases, you may not want to suppress an exception, but you want some code 
to be executed regardless of whether the code in the try block succeeds or not. To do 
this, use finally: 

f = open(path, ’w' ) 

try: 

write_to_file(f ) 


78 | Chapter 3: Built-in Data Structures, Functions, and Files 






finally: 

f,close() 

Here, the file handle f will always get closed. Similarly, you can have code that exe¬ 
cutes only if the try: block succeeds using else: 

f = open(path, ' w' ) 

try: 

write_to_file(f ) 
except: 

print( 'Failed' ) 
else: 

print( 'Succeeded' ) 
finally: 

f,close() 

Exceptions in IPython 

If an exception is raised while you are %run-ing a script or executing any statement, 
IPython will by default print a full call stack trace (traceback) with a few lines of con¬ 
text around the position at each point in the stack: 

In [10]: %run examples/ipython_bug.py 


AssertionError Traceback (most recent call last) 

/home/wesm/code/pydata-book/examples/ipython_bug.py in <module>() 

13 throws_an_exception( ) 

14 

—> 15 calling_things() 

/home/wesm/code/pydata-book/examples/ipython_bug.py in calling_things( ) 

11 def calling_things() : 

12 works_fine() 

—> 13 throws_an_exception( ) 

14 

15 calling_things() 

/home/wesm/code/pydata-book/examples/ipython_bug.py in throws_an_exception() 

7 a = 5 

8 b = 6 

-> 9 assert(a + b == 10) 

10 

11 def calling_things() : 

AssertionError: 

Having additional context by itself is a big advantage over the standard Python inter¬ 
preter (which does not provide any additional context). You can control the amount 
of context shown using the %xmode magic command, from Plain (same as the stan¬ 
dard Python interpreter) to Verbose (which inlines function argument values and 
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more). As you will see later in the chapter, you can step into the stack (using the 
%debug or %pdb magics) after an error has occurred for interactive post-mortem 
debugging. 

3.3 Files and the Operating System 

Most of this book uses high-level tools like pandas. read_csv to read data files from 
disk into Python data structures. However, it’s important to understand the basics of 
how to work with files in Python. Fortunately, it’s very simple, which is one reason 
why Python is so popular for text and file munging. 

To open a file for reading or writing, use the built-in open function with either a rela¬ 
tive or absolute file path: 

In [207]: path = 'examples/segismundo.txt' 

In [208]: f = open(path) 

By default, the file is opened in read-only mode ' r'. We can then treat the file handle 
f like a list and iterate over the lines like so: 

for line in f: 
pass 

The lines come out of the file with the end-of-line (EOL) markers intact, so you’ll 
often see code to get an EOL-free list of lines in a file like: 

In [209]: lines = [x.rstripQ for x in open(path)] 

In [210]: lines 
Out[210] : 

['Suena el rico en su riqueza,', 

'que mas cuidados le ofrece;' , 

J 

'suena el pobre que padece', 

'su miseria y su pobreza;', 

) 

'suena el que a nedrar empieza,', 

'suena el que afana y pretende,', 

'suena el que agravia y ofende,', 

) 

'y en el mundo, en conclusion,', 

'todos suenan lo que son,', 

'aunque ninguno lo entiende.', 

"] 

When you use open to create file objects, it is important to explicitly close the file 
when you are finished with it. Closing the file releases its resources back to the oper¬ 
ating system: 

In [211]: f.close() 


80 | Chapter 3: Built-in Data Structures, Functions, and Files 




One of the ways to make it easier to clean up open files is to use the with statement: 

In [212]: with open(path) as f: 

.: lines = [x.rstripQ for x in f] 

This will automatically close the file f when exiting the with block. 

If we had typed f = open(path, ' w'), a new file at examplesfsegismundo.txt would 
have been created (be careful!), overwriting any one in its place. There is also the ' x' 
file mode, which creates a writable file but fails if the file path already exists. See 
Table 3-3 for a list of all valid file read/write modes. 

For readable files, some of the most commonly used methods are read, seek, and 
tell, read returns a certain number of characters from the file. What constitutes a 
“character” is determined by the file’s encoding (e.g., UTF-8) or simply raw bytes if 
the file is opened in binary mode: 

In [213]: f = open(path) 

In [214]: f.read(10) 

0ut[214] : 'Suena el r' 

In [215]: f2 = open(path, 'rb') # Binary node 

In [216]: f2.read(10) 

0ut[216]: b' Sue\xc3\xbla el 1 

The read method advances the file handles position by the number of bytes read, 
tell gives you the current position: 

In [217]: f.tell() 

0ut[217] : 11 

In [218]: f2.tell() 

0ut[218] : 10 

Even though we read 10 characters from the file, the position is 11 because it took 
that many bytes to decode 10 characters using the default encoding. You can check 
the default encoding in the sys module: 

In [219]: import 

In [220]: sys.getdefaultencodingQ 
Out[220] : 'utf-8' 

seek changes the file position to the indicated byte in the file: 

In [221]: f.seek(3) 

0ut[221] : 3 

In [222]: f.read(l) 

0ut[222] : 'n' 
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Lastly, we remember to close the files: 
In [223]: f.closeQ 
In [224]: f2.close() 

Table 3-3. Python file modes 


Mode Description 


r Read-only mode 

w Write-only mode; creates a new file (erasing the data for any file with the same name) 

x Write-only mode; creates a new file, but fails if the file path already exists 

a Append to existing file (create the file if it does not already exist) 
r+ Read and write 

b Add to mode for binary files (i.e., ' rb' or 1 wb 1 ) 

t Text mode for files (automatically decoding bytes to Unicode). This is the default if not specified. Add t to other 
modes to use this (i.e., 1 rt' or' xt') 


To write text to a file, you can use the file’s write or writelines methods. For exam¬ 
ple, we could create a version of prof_mod.py with no blank lines like so: 

In [225]: with open(' tmp.txt' , 'w') as handle: 

.: handle.writelines(x for x in open(path) if len(x) > 1) 

In [226]: with open(' tnp.txt' ) as f: 

.: lines = f.readlines() 

In [227]: lines 

0ut[227] : 

['Suena el rico en su riqueza,\n', 

'que mas cuidados le ofrece;\n', 

'suena el pobre que padece\n', 

'su miseria y su pobreza;\n', 

'suena el que a nedrar enpieza,\n', 

'suena el que afana y pretende,\n' , 

'suena el que agravia y ofende,\n', 

'y en el nundo, en conclusion, \n' , 

'todos suenan lo que son,\n', 

'aunque ninguno lo entiende.\n' ] 

See Table 3-4 for many of the most commonly used file methods. 

Table 3-4. Important Python file methods or attributes 


Method Description 


read([size]) Return data from file as a string, with optional size argument indicating the number of 

bytes to read 

readlines ([size]) Return list of lines in the file, with optional size argument 
write(str) Write passed string to file 
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1 Method 

Description 


writelines(strings) 

Write passed sequence of strings to the file 


close() 

Close the handle 


flush() 

Flush the internal I/O buffer to disk 


seek(pos) 

Move to indicated hie position (integer) 


tell() 

Return current hie position as integer 


closed 

T rue if the hie is closed 



Bytes and Unicode with Files 

The default behavior for Python files (whether readable or writable) is text mode, 
which means that you intend to work with Python strings (i.e., Unicode). This con¬ 
trasts with binary mode, which you can obtain by appending b onto the file mode. 
Let’s look at the file (which contains non-ASCII characters with UTF-8 encoding) 
from the previous section: 

In [230]: with open(path) as f: 

.: chars = f.read(10) 

In [231]: chars 
0ut[231]: 'Suena el r' 

UTF-8 is a variable-length Unicode encoding, so when I requested some number of 
characters from the file, Python reads enough bytes (which could be as few as 10 or as 
many as 40 bytes) from the file to decode that many characters. If I open the file in 
1 rb 1 mode instead, read requests exact numbers of bytes: 

In [232]: with open(path, 'rb') as f: 

.: data = f.read(10) 


In [233]: data 

0ut[233]: b' Sue\xc3\xbla el 1 

Depending on the text encoding, you may be able to decode the bytes to a str object 
yourself, but only if each of the encoded Unicode characters is fully formed: 

In [234]: data.decode( 'utf8' ) 

0ut[234] : 'Suena el ' 

In [235]: data[ :4] .decode( 'utf8' ) 


UnicodeDecodeError Traceback (most recent call last) 

<ipython-input-235-300e0afl0bb7> in <module>() 

-> 1 data[ :4] .decode( 'utf8' ) 

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc3 in position 3: unexpecte 
d end of data 

Text mode, combined with the encoding option of open, provides a convenient way 
to convert from one Unicode encoding to another: 
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In [236]: slnk_path = 'sink.txt' 

In [237]: with open(path) as source: 

.: with open(sink_path, 'xt', encoding='iso-8859-l' ) as sink: 

.: sink.write(source.read()) 

In [238]: with open(sink_path, encoding= 1 iso-8859-1' ) as f: 

.: print(f . read(10)) 

Suena el r 

Beware using seek when opening files in any mode other than binary. If the file posi¬ 
tion falls in the middle of the bytes defining a Unicode character, then subsequent 
reads will result in an error: 

In [240]: f = open(path) 

In [241]: f.read(5) 

0ut[241]: 'Suena' 

In [242]: f.seek(4) 

0ut[242] : 4 

In [243]: f.read(l) 


UnicodeDecodeError Traceback (most recent call last) 

<ipython-input-243-7841103e33f5> in <module>() 

-> 1 f.read(l) 

/niniconda/envs/book-env/lib/python3.6/codecs.py in decode(self, input, final) 

319 # decode input (taking the buffer into account) 

320 data = self. buffer + input 

--> 321 (result, consumed) = self ._buffer_decode(data, self. errors, final 

) 

322 # keep undecoded input until the next call 

323 self. buffer = data[consumed :] 

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xbl in position 0: invalid s 
tart byte 

In [244]: f.closeQ 

If you find yourself regularly doing data analysis on non-ASCII text data, mastering 
Pythons Unicode functionality will prove valuable. See Python’s online documenta¬ 
tion for much more. 

3.4 Conclusion 

With some of the basics and the Python environment and language now under our 
belt, it’s time to move on and learn about NumPy and array-oriented computing in 
Python. 


84 | Chapter 3: Built-in Data Structures, Functions, and Files 








CHAPTER 4 


NumPy Basics: Arrays and 
Vectorized Computation 


NumPy, short for Numerical Python, is one of the most important foundational pack¬ 
ages for numerical computing in Python. Most computational packages providing 
scientific functionality use NumPy’s array objects as the lingua franca for data 
exchange. 

Here are some of the things you’ll find in NumPy: 

• ndarray, an efficient multidimensional array providing fast array-oriented arith¬ 
metic operations and flexible broadcasting capabilities. 

• Mathematical functions for fast operations on entire arrays of data without hav¬ 
ing to write loops. 

• Tools for reading/writing array data to disk and working with memory-mapped 
files. 

• Linear algebra, random number generation, and Fourier transform capabilities. 

• AC API for connecting NumPy with libraries written in C, C++, or FORTRAN. 

Because NumPy provides an easy-to-use C API, it is straightforward to pass data to 
external libraries written in a low-level language and also for external libraries to 
return data to Python as NumPy arrays. This feature has made Python a language of 
choice for wrapping legacy C/C++/Fortran codebases and giving them a dynamic and 
easy-to-use interface. 

While NumPy by itself does not provide modeling or scientific functionality, having 
an understanding of NumPy arrays and array-oriented computing will help you use 
tools with array-oriented semantics, like pandas, much more effectively. Since 
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NumPy is a large topic, I will cover many advanced NumPy features like broadcasting 
in more depth later (see Appendix A). 

For most data analysis applications, the main areas of functionality I’ll focus on are: 

• Fast vectorized array operations for data munging and cleaning, subsetting and 
filtering, transformation, and any other kinds of computations 

• Common array algorithms like sorting, unique, and set operations 

• Efficient descriptive statistics and aggregating/summarizing data 

• Data alignment and relational data manipulations for merging and joining 
together heterogeneous datasets 

• Expressing conditional logic as array expressions instead of loops with if-elif- 
else branches 

• Group-wise data manipulations (aggregation, transformation, function applica¬ 
tion) 

While NumPy provides a computational foundation for general numerical data pro¬ 
cessing, many readers will want to use pandas as the basis for most kinds of statistics 
or analytics, especially on tabular data, pandas also provides some more domain- 
specific functionality like time series manipulation, which is not present in NumPy. 



Array-oriented computing in Python traces its roots back to 1995, 
when Jim Hugunin created the Numeric library. Over the next 10 
years, many scientific programming communities began doing 
array programming in Python, but the library ecosystem had 
become fragmented in the early 2000s. In 2005, Travis Oliphant 
was able to forge the NumPy project from the then Numeric and 
Numarray projects to bring the community together around a sin¬ 
gle array computing framework. 


One of the reasons NumPy is so important for numerical computations in Python is 
because it is designed for efficiency on large arrays of data. There are a number of 
reasons for this: 

• NumPy internally stores data in a contiguous block of memory, independent of 
other built-in Python objects. NumPy s library of algorithms written in the C lan¬ 
guage can operate on this memory without any type checking or other overhead. 
NumPy arrays also use much less memory than built-in Python sequences. 

• NumPy operations perform complex computations on entire arrays without the 
need for Python for loops. 
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To give you an idea of the performance difference, consider a NumPy array of one 
million integers, and the equivalent Python list: 

In [7]: import as 

In [8]: my_arr = np.arange(1000000) 

In [9]: my_list = list(range(1000000)) 

Now let’s multiply each sequence by 2: 

In [10]: %time for _ in range(10): my_arr2 = my_arr * 2 
CPU times: user 20 ms, sys: 50 ms, total: 70 ms 
Wall time: 72.4 ms 

In [11]: %time for _ in range(10): my_list2 = [x * 2 for x in my_list] 

CPU times: user 760 ms, sys: 290 ms, total: 1.05 s 
Wall time: 1.05 s 

NumPy-based algorithms are generally 10 to 100 times faster (or more) than their 
pure Python counterparts and use significantly less memory. 

4.1 The NumPy ndarray: A Multidimensional Array Object 

One of the key features of NumPy is its N-dimensional array object, or ndarray, 
which is a fast, flexible container for large datasets in Python. Arrays enable you to 
perform mathematical operations on whole blocks of data using similar syntax to the 
equivalent operations between scalar elements. 

To give you a flavor of how NumPy enables batch computations with similar syntax 
to scalar values on built-in Python objects, I first import NumPy and generate a small 
array of random data: 

In [12]: import as 

# Generate some random data 

In [13]: data = np.random. randn(2, 3) 

In [14]: data 
0ut[14] : 

array( [[ -0.2047, 0.4789, -0.5194], 

[-0.5557, 1.9658, 1.3934]]) 

I then write mathematical operations with data: 

In [15]: data * 10 
0ut[15] : 

array( [[ -2.0471, 4.7894, -5.1944], 

[ -5.5573, 19.6578, 13.9341]]) 

In [16]: data + data 
0ut[16] : 
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array([[-0.4094, 0.9579, -1.0389], 

[-1.1115, 3.9316, 2.7868]]) 

In the first example, all of the elements have been multiplied by 10. In the second, the 
corresponding values in each “cell” in the array have been added to each other. 



In this chapter and throughout the book, I use the standard 
NumPy convention of always using import numpy as np. You are, 
of course, welcome to put from numpy import * in your code to 
avoid having to write np., but I advise against making a habit of 
this. The numpy namespace is large and contains a number of func¬ 
tions whose names conflict with built-in Python functions (like min 
and max). 


An ndarray is a generic multidimensional container for homogeneous data; that is, all 
of the elements must be the same type. Every array has a shape, a tuple indicating the 
size of each dimension, and a dtype, an object describing the data type of the array: 

In [17]: data.shape 
0ut[17] : (2, 3) 

In [18]: data.dtype 
0ut[18]: dtype( 'ftoat64' ) 

This chapter will introduce you to the basics of using NumPy arrays, and should be 
sufficient for following along with the rest of the book While it’s not necessary to 
have a deep understanding of NumPy for many data analytical applications, becom¬ 
ing proficient in array-oriented programming and thinking is a key step along the 
way to becoming a scientific Python guru. 



Whenever you see “array,” “NumPy array,” or “ndarray” in the text, 
with few exceptions they all refer to the same thing: the ndarray 
object. 


Creating ndarrays 

The easiest way to create an array is to use the array function. This accepts any 
sequence-like object (including other arrays) and produces a new NumPy array con¬ 
taining the passed data. For example, a list is a good candidate for conversion: 

In [19]: datal = [6, 7.5, 8, 0, 1] 

In [20]: arrl = np.array(datal) 

In [21]: arrl 

0ut[21] : array( [ 6. , 7.5, 8. , 0. , 1. ]) 
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Nested sequences, like a list of equal-length lists, will be converted into a multidimen¬ 
sional array: 

In [22]: data2 = [[1, 2, 3, 4], [5, 6, 7, 8]] 

In [23]: arr2 = np.array(data2) 

In [24]: arr2 
0ut[24] : 

array([[l, 2, 3, 4], 

[5, 6, 7, 8]]) 

Since data2 was a list of lists, the NumPy array arr2 has two dimensions with shape 
inferred from the data. We can confirm this by inspecting the ndim and shape 
attributes: 

In [25]: arr2.ndim 
0ut[25] : 2 

In [26]: arr2.shape 
0ut[26] : (2, 4) 

Unless explicitly specified (more on this later), np.array tries to infer a good data 
type for the array that it creates. The data type is stored in a special dtype metadata 
object; for example, in the previous two examples we have: 

In [27]: arrl.dtype 
0ut[27]: dtype( 'ftoat64' ) 

In [28]: arr2.dtype 
0ut[28]: dtype( 1 int64’ ) 

In addition to np.array, there are a number of other functions for creating new 
arrays. As examples, zeros and ones create arrays of Os or Is, respectively, with a 
given length or shape, empty creates an array without initializing its values to any par¬ 
ticular value. To create a higher dimensional array with these methods, pass a tuple 
for the shape: 

In [29]: np.zeros(lO) 

0ut[29]: array([ 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.]) 


In [30]: np.zeros((3, 6)) 
Out[30] : 


array( [[ 

0., 

0., 

0., 

0., 

0., 

0.], 

[ 

0., 

0., 

0., 

0., 

0., 

0.], 

[ 

0., 

0., 

0., 

0., 

0., 

0.]]) 


In [31]: np.empty((2, 3, 2)) 
0ut[31] : 

array( [[[ 0., 0.], 

[ 0 -, 0 -], 

[ 0 -, 0 -]], 
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[[ 0 ., 0 .], 

[ 0 ., 0 .], 

[ 0 ., 0 .]]]) 



It’s not safe to assume that np.empty will return an array of all 
zeros. In some cases, it may return uninitialized “garbage” values. 


arange is an array-valued version of the built-in Python range function: 

In [32]: np.arange(15) 

0ut[32] : array( [ 0, 1, 2, 3, 4, 5, 6, 1, 8, 9, 10, 11, 12, 13, 14]) 

See Table 4-1 for a short list of standard array creation functions. Since NumPy is 
focused on numerical computing, the data type, if not specified, will in many cases be 
float64 (floating point). 


Table 4-1. Array creation functions 


Function Description 


array Convert input data (list, tuple, array, or other sequence type) to an ndarray either by inferring a dtype 

or explicitly specifying a dtype; copies the input data by default 
asarray Convert input to ndarray, but do not copy if the input is already an ndarray 

arange Like the built-in range but returns an ndarray instead of a list 


ones. Produce an array of all Is with the given shape and dtype; ones_ltke takes another array and 

ones_like produces a ones array of the same shape and dtype 


zeros, Like ones and ones_Like but producing arrays of Os instead 

zeros like 


empty, 
empty_ltke 
full, 
full_like 
eye, Identity 


Create new arrays by allocating new memory, but do not populate with any values like ones and 
zeros 

Produce an array of the given shape and dtype with all values set to the indicated “fill value" 
full_llke takes another array and produces a filled array of the same shape and dtype 
Create a square N x N identity matrix (Is on the diagonal and Os elsewhere) 


Data Types for ndarrays 

The data type or dtype is a special object containing the information (or metadata, 

data about data) the ndarray needs to interpret a chunk of memory as a particular 

type of data: 

In [33]: arrl = np.array([l, 2, 3], dtype=np.float64) 

In [34]: arr2 = np.array([l, 2, 3], dtype=np.int32) 

In [35]: arrl.dtype 
Out[35]: dtype( 'float64' ) 
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In [36]: arr2.dtype 
Out[36]: dtype( 'int32 1 ) 

dtypes are a source of NumPy’s flexibility for interacting with data coming from other 
systems. In most cases they provide a mapping directly onto an underlying disk or 
memory representation, which makes it easy to read and write binary streams of data 
to disk and also to connect to code written in a low-level language like C or Fortran. 
The numerical dtypes are named the same way: a type name, like float or Int, fol¬ 
lowed by a number indicating the number of bits per element. A standard double¬ 
precision floating-point value (what’s used under the hood in Pythons float object) 
takes up 8 bytes or 64 bits. Thus, this type is known in NumPy as float64. See 
Table 4-2 for a full listing of NumPy’s supported data types. 



Don’t worry about memorizing the NumPy dtypes, especially if 
you’re a new user. It’s often only necessary to care about the general 
kind of data you’re dealing with, whether floating point, complex, 
integer, boolean, string, or general Python object. When you need 
more control over how data are stored in memory and on disk, 
especially large datasets, it is good to know that you have control 
over the storage type. 


Table 4-2. NumPy data types 


Type 

Type code 

Description 

int8, uint8 

11, ul 

Signed and unsigned 8-bit (1 byte) integer types 

intl6, utntl6 

12, u2 

Signed and unsigned 16-bit integer types 

int32, utnt32 

14, u4 

Signed and unsigned 32-bit integer types 

lnt64, utnt64 

18, u8 

Signed and unsigned 64-bit integer types 

floatl6 

f2 

Half-precision floating point 

float32 

f4 or f 

Standard single-precision floating point; compatible with C float 

float64 

f8 or d 

Standard double-precision floating point; compatible with C double and 
Python float object 

floatl28 

fl6 or g 

Extended-precision floating point 

complex64, 

c8, cl6. 

Complex numbers represented by two 32,64, or 128 floats, respectively 

compIexl28, 

comp!ex256 

c32 


bool 

? 

Boolean type storing True and False values 

object 

0 

Python object type; a value can be any Python object 

string_ 

S 

Fixed-length ASCII string type (1 byte per character); for example, to create a 
string dtype with length 10, use 1 S10' 

unlcode_ 

U 

Fixed-length Unicode type (number of bytes platform specific); same 
specification semantics as strlng (e.g., ' LI10') 
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You can explicitly convert or cast an array from one dtype to another using ndarray s 
astype method: 

In [37]: arr = np.array([l, 2, 3, 4, 5]) 

In [38]: arr.dtype 
0ut[38]: dtype( 1 int64' ) 

In [39]: float_arr = arr.astype(np.float64) 

In [40]: float_arr.dtype 
Out[40]: dtype( 1 float64' ) 

In this example, integers were cast to floating point. If I cast some floating-point 
numbers to be of integer dtype, the decimal part will be truncated: 

In [41]: arr = np.array( [3 . 7, -1.2, -2.6, 0.5, 12.9, 10.1]) 

In [42]: arr 

0ut[42] : array( [ 3.7, -1.2, -2.6, 0.5, 12.9, 10.1]) 

In [43]: arr.astype(np.int32) 

0ut[43]: array([ 3, -1, -2, 0, 12, 10], dtype=int32) 

If you have an array of strings representing numbers, you can use astype to convert 
them to numeric form: 

In [44]: numertc_strings = np.array( [ 1 1.25 1 , ’-9.6', '42'], dtype=np.strlng_) 

In [45]: numeri.c_stri.ngs. astype(float) 

0ut[45] : array( [ 1.25, -9.6 , 42. ]) 



It’s important to be cautious when using the numpy.string_ type, 
as string data in NumPy is fixed size and may truncate input 
without warning, pandas has more intuitive out-of-the-box behav¬ 
ior on non-numeric data. 


If casting were to fail for some reason (like a string that cannot be converted to 
float64), a ValueError will be raised. Here I was a bit lazy and wrote float instead 
of np. f loat64; NumPy aliases the Python types to its own equivalent data dtypes. 

You can also use another array’s dtype attribute: 

In [46]: int_array = np.arange(10) 

In [47]: calibers = np. array( [. 22, .270, .357, .380, .44, .50], dtype=np.float64) 
In [48]: int_array.astype(calibers.dtype) 

0ut[48] : array( [ 0., 1., 2., 3., 4., 5., 6., 7., 8., 9.]) 


92 | Chapter 4: NumPy Basics: Arrays and Vectorized Computation 





There are shorthand type code strings you can also use to refer to a dtype: 
In [49]: empty_uint32 = np.empty(8, dtype='u4') 

In [50]: empty_uint32 
Out[50] : 

array( [ 0, 1075314688, 0, 1075707904, 0, 

1075838976, 0, 1072693248], dtype=utnt32) 



Calling astype always creates a new array (a copy of the data), even 
if the new dtype is the same as the old dtype. 


Arithmetic with NumPy Arrays 

Arrays are important because they enable you to express batch operations on data 
without writing any for loops. NumPy users call this vectorization. Any arithmetic 
operations between equal-size arrays applies the operation element-wise: 


In [51]: 

arr = np.array([[l 

In [52]: 

arr 

0ut[52] : 


array([[ 

1., 2., 3.], 

[ 

4., 5., 6.]]) 

In [53]: 

arr * arr 

0ut[53] : 


array( [[ 

1., 4., 9.], 

[ 

16., 25., 36.]]) 

In [54]: 

arr - arr 

0ut[54] : 


array([[ 

0., 0., 0.], 

[ 

0., 0., 0.]]) 


Arithmetic operations with scalars propagate the scalar argument to each element in 
the array: 


In [55]: 
0ut[55] : 
array([[ 

1 / arr 

1 . 

0.5 , 

0.3333], 

[ 

0.25 , 

0.2 , 

0.1667]]) 

In [56]: 

arr ** 0 

.5 


0ut[56] : 
array( [ [ 

1 . 

1.4142, 

1.7321], 

[ 

2. 

2.2361, 

2.4495]]) 


Comparisons between arrays of the same size yield boolean arrays: 
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In [57]: arr2 = np.array([ [0. , 4., 1.], [7., 2., 12.]]) 


In [58]: arr2 
Out[58] : 

array( [[ 0., 4., 1.], 

[ 7., 2., 12.]]) 

In [59]: arr2 > arr 
Out[59] : 

array([[False, True, False], 

[ True, False, True]], dtype=bool) 

Operations between differently sized arrays is called broadcasting and will be dis¬ 
cussed in more detail in Appendix A. Having a deep understanding of broadcasting is 
not necessary for most of this book. 

Basic Indexing and Slicing 

NumPy array indexing is a rich topic, as there are many ways you may want to select 
a subset of your data or individual elements. One-dimensional arrays are simple; on 
the surface they act similarly to Python lists: 

In [60]: arr = np.arange(lO) 

In [61]: arr 

0ut[61] : array( [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]) 

In [62]: arr[5] 

0ut[62] : 5 

In [63]: arr[5:8] 

0ut[63]: array([5, 6, 7]) 

In [64]: arr[5:8] = 12 

In [65]: arr 

0ut[65] : array( [ 0, 1, 2, 3, 4, 12, 12, 12, 8, 9]) 

As you can see, if you assign a scalar value to a slice, asinarr[5:8] = 12, the value is 
propagated (or broadcasted henceforth) to the entire selection. An important first dis¬ 
tinction from Pythons built-in lists is that array slices are views on the original array. 
This means that the data is not copied, and any modifications to the view will be 
reflected in the source array. 

To give an example of this, I first create a slice of arr: 

In [66]: arr_slice = arr[5:8] 

In [67]: arr_slice 

0ut[67] : array([12, 12, 12]) 
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Now, when I change values in arr_sli.ce, the mutations are reflected in the original 
array arr: 

In [68]: arr_slice[l] = 12345 
In [69]: arr 

0ut[69] : array( [ 0, 1, 2, 3, 4, 12, 12345, 12, 8, 

9]) 

The “bare” slice [: ] will assign to all values in an array: 

In [70]: arr_slice[:] = 64 
In [71]: arr 

0ut[71] : array([ 0, 1, 2, 3, 4, 64, 64, 64, 8, 9]) 

If you are new to NumPy, you might be surprised by this, especially if you have used 
other array programming languages that copy data more eagerly. As NumPy has been 
designed to be able to work with very large arrays, you could imagine performance 
and memory problems if NumPy insisted on always copying data. 



If you want a copy of a slice of an ndarray instead of a view, you 
will need to explicitly copy the array—for example, 
arr[5:8] .copy(). 


With higher dimensional arrays, you have many more options. In a two-dimensional 
array, the elements at each index are no longer scalars but rather one-dimensional 
arrays: 

In [72]: arr2d = np.array([[l, 2, 3], [4, 5, 6], [7, 8, 9]]) 

In [73]: arr2d[2] 

0ut[73]: array([7, 8, 9]) 

Thus, individual elements can be accessed recursively. But that is a bit too much 
work, so you can pass a comma-separated list of indices to select individual elements. 
So these are equivalent: 

In [74]: arr2d[0][2] 

0ut[74] : 3 

In [75]: arr2d[0, 2] 

0ut[75] : 3 

See Figure 4-1 for an illustration of indexing on a two-dimensional array. I find it 
helpful to think of axis 0 as the “rows” of the array and axis 1 as the “columns.” 
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In multidimensional arrays, if you omit later indices, the returned object will be a 
lower dimensional ndarray consisting of all the data along the higher dimensions. So 
in the 2x2x3 array arr3d: 

In [76]: arr3d = np.array([[[l, 2, 3], [4, 5, 6]], [[7, 8, 9], [10, 11, 12]]]) 


In [77]: arr3d 
0ut[77] : 


array( [[[ 1, 

2, 

3], 

[ 4, 

5, 

6]], 

[[ 7, 

8, 

9], 

[10, 

11, 

12]]]) 


arr3d[0] is a 2 x 3 array: 

In [78]: arr3d[0] 

0ut[78] : 

array([[l, 2, 3], 

[4, 5, 6]]) 

Both scalar values and arrays can be assigned to a rr3d [0]: 

In [79]: old_values = arr3d[0],copy() 

In [80]: arr3d[0] = 42 

In [81]: arr3d 
0ut[81] : 

array( [[ [42, 42, 42], 

[42, 42, 42]], 

[[ 7, 8, 9], 

[ 10 , 11 , 12 ]]]) 

In [82]: arr3d[0] = old_values 
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In [83]: arr3d 
Out[83] : 


array([[[ 1, 

2, 

3], 

[ 4, 

5, 

6]], 

[[ 7, 

8, 

9], 

[10, 

11, 

12]]]) 


Similarly, arr3d[l, 0] gives you all of the values whose indices start with (1, 0), 
forming a 1-dimensional array: 

In [84]: arr3d[l, 0] 

0ut[84]: array([7, 8, 9]) 

This expression is the same as though we had indexed in two steps: 

In [85]: x = arr3d[l] 

In [86]: x 
0ut[86] : 

array( [ [ 7, 8, 9], 

[ 10 , 11 , 12 ]]) 

In [87]: x[0] 

0ut[87]: array([7, 8, 9]) 

Note that in all of these cases where subsections of the array have been selected, the 
returned arrays are views. 

Indexing with slices 

Like one-dimensional objects such as Python lists, ndarrays can be sliced with the 
familiar syntax: 


In [88]: arr 




0ut[88]: array([ 0, 

1 , 

2, 

3, 4, 

In [89]: arr[l:6] 




0ut[89] : array( [ 1, 

2, 

3, 

4, 64]) 


Consider the two-dimensional array from before, arr2d. Slicing this array is a bit 
different: 

In [90]: arr2d 
Out[90] : 

array([[l, 2, 3], 

[4, 5, 6], 

[7, 8, 9]]) 

In [91]: arr2d[:2] 

0ut[91] : 

array([[l, 2, 3], 

[4, 5, 6]]) 
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As you can see, it has sliced along axis 0, the first axis. A slice, therefore, selects a 
range of elements along an axis. It can be helpful to read the expression arr2d[ :2] as 
“select the first two rows of arr2d.” 

You can pass multiple slices just like you can pass multiple indexes: 

In [92]: arr2d[:2, 1:] 

0ut[92] : 
array([[2, 3], 

[5, 6]]) 

When slicing like this, you always obtain array views of the same number of dimen¬ 
sions. By mixing integer indexes and slices, you get lower dimensional slices. 

For example, I can select the second row but only the first two columns like so: 

In [93]: arr2d[l, :2] 

0ut[93]: array([4, 5]) 

Similarly, I can select the third column but only the first two rows like so: 

In [94]: arr2d[:2, 2] 

0ut[94]: array([3, 6]) 

See Figure 4-2 for an illustration. Note that a colon by itself means to take the entire 
axis, so you can slice only higher dimensional axes by doing: 

In [95]: arr2d[:, :1] 

0ut[95] : 
array( [[1], 

[4], 

[7]]) 

Of course, assigning to a slice expression assigns to the whole selection: 

In [96]: arr2d[:2, 1:] = 0 

In [97]: arr2d 
0ut[97] : 

array([[l, 0, 0], 

[4, 0, 0], 

[7, 8, 9]]) 


98 | Chapter 4: NumPy Basics: Arrays and Vectorized Computation 





Expression Shape 





arr[:2, 1:] (2, 2) 














arr[2] (3,) 
arr[2, :] (3,) 
arr[2:, :] (l, 3) 














arr[:, :2] (3, 2) 














arr[l, :2] (2,) 

arr[l:2, :2] (l, 2) 











Figure 4-2. Two-dimensional array slicing 


Boolean Indexing 

Let’s consider an example where we have some data in an array and an array of names 
with duplicates. I’m going to use here the randn function in nunpy. random to generate 
some random normally distributed data: 

In [98]: names = np.array([ 'Bob' , 'Joe', 'Will', 'Bob', 'Will', 'Joe', 'Joe']) 

In [99]: data = np.random.randn(7, 4) 

In [100]: names 
Out[100] : 

array( [' Bob' , 'Joe', 'Will', 'Bob', ’Will', 'Joe', 'Joe'], 
dtype='<U4' ) 

In [101]: data 
Out[101] : 

array( [[ 0.0929, 0.2817, 0.769 , 1.2464], 

[ 1.0072, -1.2962, 0.275 , 0.2289], 

[ 1.3529, 0.8864, -2.0016, -0.3718], 

[ 1.669 , -0.4386, -0.5397, 0.477 ], 

[ 3.2489, -1.0212, -0.5771, 0.1241], 
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[ 0.3026, 0.5238, 0.0009, 1.3438], 

[-0.7135, -0.8312, -2.3702, -1.8608]]) 

Suppose each name corresponds to a row in the data array and we wanted to select 
all the rows with corresponding name ' Bob '. Like arithmetic operations, compari¬ 
sons (such as ==) with arrays are also vectorized. Thus, comparing names with the 
string ' Bob ' yields a boolean array: 

In [102]: names == 'Bob' 

Out[102]: array([ True, False, False, True, False, False, False], dtype=bool) 

This boolean array can be passed when indexing the array: 

In [103]: data[names == 'Bob'] 

Out[103]: 

array([[ 0.0929, 0.2817, 0.769 , 1.2464], 

[ 1.669 , -0.4386, -0.5397, 0.477 ]]) 

The boolean array must be of the same length as the array axis it’s indexing. You can 
even mix and match boolean arrays with slices or integers (or sequences of integers; 
more on this later). 



Boolean selection will not fail if the boolean array is not the correct 
length, so I recommend care when using this feature. 


In these examples, I select from the rows where names == 'Bob' and index the col¬ 
umns, too: 

In [104]: data[names == 'Bob', 2:] 

Out[104] : 

array([[ 0.769 , 1.2464], 

[-0.5397, 0.477 ]]) 


In [105]: data[names == 'Bob', 3] 

Out[105]: array( [ 1.2464, 0.477 ]) 

To select everything but ' Bob ', you can either use ! = or negate the condition using ~: 

In [106]: names != 'Bob' 

Out[106]: array( [False, True, True, False, True, True, True], dtype=bool) 

In [107]: data[~(names == 'Bob')] 

Out[107]: 

array([[ 1.0072, -1.2962, 0.275 , 0.2289], 

[ 1.3529, 0.8864, -2.0016, -0.3718], 

[ 3.2489, -1.0212, -0.5771, 0.1241], 

[ 0.3026, 0.5238, 0.0009, 1.3438], 

[-0.7135, -0.8312, -2.3702, -1.8608]]) 
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The ~ operator can be useful when you want to invert a general condition: 

In [108]: cond = names == 'Bob' 

In [109]: data[~cond] 

Out[109]: 

array( [[ 1.0072, -1.2962, 0.275 , 0.2289], 

[ 1.3529, 0.8864, -2.0016, -0.3718], 

[ 3.2489, -1.0212, -0.5771, 0.1241], 

[ 0.3026, 0.5238, 0.0009, 1.3438], 

[-0.7135, -0.8312, -2.3702, -1.8608]]) 

Selecting two of the three names to combine multiple boolean conditions, use 
boolean arithmetic operators like & (and) and | (or): 

In [110]: mask = (names == 'Bob') | (names == 'Will') 

In [111]: mask 

Out[lll]: array([ True, False, True, True, True, False, False], dtype=bool) 

In [112]: data[mask] 

0ut[112]: 

array( [[ 0.0929, 0.2817, 0.769 , 1.2464], 

[ 1.3529, 0.8864, -2.0016, -0.3718], 

[ 1.669 , -0.4386, -0.5397, 0.477 ], 

[ 3.2489, -1.0212, -0.5771, 0.1241]]) 

Selecting data from an array by boolean indexing always creates a copy of the data, 
even if the returned array is unchanged. 



The Python keywords and and or do not work with boolean arrays. 
Use & (and) and | (or) instead. 


Setting values with boolean arrays works in a common-sense way. To set all of the 
negative values in data to 0 we need only do: 

In [113]: data[data < 0] = 0 


In [114]: data 
0ut[114] : 


array( [[ 

0.0929, 

0.2817, 

0.769 , 

1.2464], 

[ 

1.0072, 

0. 

0.275 , 

0.2289], 

[ 

1.3529, 

0.8864, 

0. 

0. 

1, 

[ 

1.669 , 

0. 

0. 

0.477 ], 

[ 

3.2489, 

0. 

0. 

0.1241], 

[ 

0.3026, 

0.5238, 

0.0009, 

1.3438], 

[ 

0. 

0. 

0. 

0. 

11) 
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Setting whole rows or columns using a one-dimensional boolean array is also easy: 


In [115]: 

In [116]: 
0ut[116]: 
array([[ 

: data[names != ' 

: data 

7. ,7. 

Joe 1 ] = 7 

7. 

7. 

]. 

[ 

1.0072, 

0. 

0.275 , 

0.2289], 

[ 

7. 

7. 

7. 

7. 

1. 

[ 

7. 

7. 

7. 

7. 

1. 

[ 

7. 

7. 

7. 

7. 

1, 

[ 

0.3026, 

0.5238, 

0.0009, 

1.3438], 

[ 

0. 

0. 

0. 

0. 

11) 


As we will see later, these types of operations on two-dimensional data are convenient 
to do with pandas. 


Fancy Indexing 

Fancy indexing is a term adopted by NumPy to describe indexing using integer arrays. 
Suppose we had an 8 x 4 array: 

In [117]: arr = np.empty((8, 4)) 


In [118]: for 1 in range(8): 


.: arr[ 

In [119]: arr 

1] 

= i 

0ut[119] : 

array ([[ 0. 

, 0., 

0. 

, o.]. 

[ 1. 

, 1., 

1. 

, i.]. 

[ 2. 

, 2., 

2. 

. 2.], 

[ 3. 

, 3., 

3. 

, 3.], 

[ 4. 

, 4., 

4. 

, 4.], 

[ 5. 

, 5., 

5. 

, 5.], 

[ 6. 

, 6., 

6. 

> 6. ] , 

[ 7. 

, 7., 

7. 

, 7.]]) 

To select out a 

subset of the rows 

ndarray of integers specifying the 

In [120]: arr[[4. 

3, 

0, 6]] 

Out[120] : 

array( [[ 4. 

, 4., 

4. 

, 4.], 

[ 3. 

, 3., 

3. 

, 3.], 

[ 0- 

, 0., 

0. 

, 0-L 

[ 6. 

, 6., 

6. 

.. 6-]]) 


Hopefully this code did what you expected! Using negative indices selects rows from 
the end: 


In [121]: arr[[-3, -5, -7]] 
0ut[121]: 
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array([[ 5., 

5., 

5., 5.], 

[ 3., 

3., 

3., 3.], 

[ 1., 

1., 

1.. 1.]]) 


Passing multiple index arrays does something slightly different; it selects a one¬ 
dimensional array of elements corresponding to each tuple of indices: 

In [122]: arr = np.arange(32).reshape((8, 4)) 

In [123]: arr 

0ut[123] : 

array( [[ 0, 1, 2, 3], 

[ 4, 5, 6, 7], 

[ 8, 9, 10, 11], 

[12, 13, 14, 15], 

[16, 17, 18, 19], 

[20, 21, 22, 23], 

[24, 25, 26, 27], 

[28, 29, 30, 31]]) 

In [124]: arr[[l, 5, 7, 2], [0, 3, 1, 2]] 

0ut[124] : array( [ 4, 23, 29, 10]) 

We’ll look at the reshape method in more detail in Appendix A. 

Here the elements (1, 0), (5, 3), (7, l),and(2, 2) were selected. Regardless of 
how many dimensions the array has (here, only 2), the result of fancy indexing is 
always one-dimensional. 

The behavior of fancy indexing in this case is a bit different from what some users 
might have expected (myself included), which is the rectangular region formed by 
selecting a subset of the matrix’s rows and columns. Here is one way to get that: 

In [125]: arr[[l, 5, 7, 2]][:, [0, 3, 1, 2]] 

0ut[125] : 

array( [ [ 4, 7, 5, 6], 

[20, 23, 21, 22], 

[28, 31, 29, 30], 

[ 8, 11, 9, 10]]) 

Keep in mind that fancy indexing, unlike slicing, always copies the data into a new 
array. 

Transposing Arrays and Swapping Axes 

Transposing is a special form of reshaping that similarly returns a view on the under¬ 
lying data without copying anything. Arrays have the transpose method and also the 
special T attribute: 

In [126]: arr = np.arange(15).reshape((3, 5)) 

In [127]: arr 
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0ut[127]: 

array([[ 0, 1, 2, 3, 4], 

[ 5, 6, 7, 8, 9], 

[10, 11, 12, 13, 14]]) 

In [128]: arr.T 

Out[128]: 

array( [[ 0, 5, 10], 

[ 1 , 6 , 11 ], 

[ 2, 7, 12], 

[ 3, 8, 13], 

[ 4, 9, 14]]) 

When doing matrix computations, you may do this very often—for example, when 
computing the inner matrix product using np. dot: 

In [129]: arr = np.random.randn(6, 3) 

In [130]: arr 

Out[130]: 

array( [[ -0.8608, 0.5601, -1.2659], 

[ 0.1198, -1.0635, 0.3329], 

[-2.3594, -0.1995, -1.542 ], 

[-0.9707, -1.307 , 0.2863], 

[ 0.378 , -0.7539, 0.3313], 

[ 1.3497, 0.0699, 0.2467]]) 

In [131]: np.dot(arr .T, arr) 

0ut[131]: 

array([[ 9.2291, 0.9394, 4.948 ], 

[ 0.9394, 3.7662, -1.3622], 

[ 4.948 , -1.3622, 4.3437]]) 

For higher dimensional arrays, transpose will accept a tuple of axis numbers to per¬ 
mute the axes (for extra mind bending): 

In [132]: arr = np.arange(16).reshape((2, 2, 4)) 

In [133]: arr 

0ut[133]: 

array([[[ 0, 1, 2, 3], 

[ 4, 5, 6, 7]], 

[[ 8, 9, 10, 11], 

[12, 13, 14, 15]]]) 

In [134]: arr.transpose( (1, 0, 2)) 

0ut[134]: 

array([[[ 0, 1, 2, 3], 

[ 8, 9, 10, 11]], 

[[ 4, 5, 6, 7], 

[12, 13, 14, 15]]]) 
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Here, the axes have been reordered with the second axis first, the first axis second, 
and the last axis unchanged. 

Simple transposing with . T is a special case of swapping axes, ndarray has the method 
swapaxes, which takes a pair of axis numbers and switches the indicated axes to rear¬ 
range the data: 


In [135]: arr 
0ut[135]: 

array([[[ 0, 1, 2, 3], 

[ 4, 5, 6, 7]], 

[[ 8, 9, 10, 11], 

[12, 13, 14, 15]]]) 


In [136]: arr.swapaxes(l, 2) 

0ut[136]: 

array([[[ 0, 4], 

[ 1, 5], 

[ 2 , 6 ], 

[ 3, 7]], 

[[ 8 , 12 ], 

[ 9, 13], 

[10, 14], 

[11, 15]]]) 

swapaxes similarly returns a view on the data without making a copy. 

4.2 Universal Functions: Fast Element-Wise Array 
Functions 

A universal function, or ufunc, is a function that performs element-wise operations 
on data in ndarrays. You can think of them as fast vectorized wrappers for simple 
functions that take one or more scalar values and produce one or more scalar results. 

Many ufuncs are simple element-wise transformations, like sqrt or exp: 

In [137]: arr = np.arange(10) 

In [138]: arr 

0ut[138]: array( [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]) 

In [139]: np.sqrt(arr) 

0ut[139]: 

array( [ 0. ,1. , 1.4142, 1.7321, 2. , 2.2361, 2.4495, 

2.6458, 2.8284, 3. ]) 

In [140]: np.exp(arr) 

Out[140]: 

array([ 1. , 2.7183, 7.3891, 20.0855, 54.5982, 

148.4132, 403.4288, 1096.6332, 2980.958 , 8103.0839]) 
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These are referred to as unary ufuncs. Others, such as add or maximum, take two arrays 
(thus, binary ufuncs) and return a single array as the result: 

In [141]: x = np.random.randn(8) 

In [142]: y = np.random.randn(8) 

In [143]: x 
0ut[143]: 

array([-0.0119, 1.0048, 1.3272, -0.9193, -1.5491, 0.0222, 0.7584, 

-0.6605]) 

In [144]: y 
0ut[144]: 

array([ 0.8626, -0.01 , 0.05 , 0.6702, 0.853 , -0.9559, -0.0235, 

-2.3042]) 

In [145]: np.maximum(x, y) 

0ut[145]: 

array( [ 0.8626, 1.0048, 1.3272, 0.6702, 0.853 , 0.0222, 0.7584, 

-0.6605]) 

Here, numpy .maximum computed the element-wise maximum of the elements in x and 

y- 

While not common, a ufunc can return multiple arrays, modf is one example, a vec¬ 
torized version of the built-in Python divmod; it returns the fractional and integral 
parts of a floating-point array: 

In [146]: arr = np.random.randn(7) * 5 
In [147]: arr 

0ut[147]: array( [-3.2623, -6.0915, -6.663 , 5.3731, 3.6182, 3.45 , 5.0077]) 

In [148]: remainder, whote_part = np.modf(arr) 

In [149]: remainder 

0ut[149]: array( [ -0.2623, -0.0915, -0.663 , 0.3731, 0.6182, 0.45 , 0.0077]) 

In [150]: whole_part 

Out[150]: array( [ -3. , -6., -6., 5., 3., 3., 5.]) 

Ufuncs accept an optional out argument that allows them to operate in-place on 
arrays: 

In [151]: arr 

0ut[151]: array( [-3.2623, -6.0915, -6.663 , 5.3731, 3.6182, 3.45 , 5.0077]) 

In [152]: np.sqrt(arr) 

0ut[152]: array( [ nan, nan, nan, 2.318 , 1.9022, 1.8574, 2.2378]) 

In [153]: np.sqrt(arr, arr) 
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Out[153] : array( [ nan, nan, nan, 2.318 , 1.9022, 1.8574, 2.2378]) 

In [154]: arr 

0ut[154] : array( [ nan, nan, nan, 2.318 , 1.9022, 1.8574, 2.2378]) 

See Tables 4-3 and 4-4 for a listing of available ufuncs. 


Table 4-3. Unary ufuncs 


Function 

Description 

abs, tabs 

Compute the absolute value element-wise for integer, floating-point, or complex values 

sqrt 

Compute the square root of each element (equivalent to arr ** 0.5) 

square 

Compute the square of each element (equivalent to arr ** 2) 

exp 

Compute the exponent e x of each element 

log, logl0, 
log2, loglp 

Natural logarithm (base e), log base 10, log base 2, and log(1 + x), respectively 

sign 

Compute the sign of each element: 1 (positive), 0 (zero), or -1 (negative) 

cell 

Compute the ceiling of each element (i.e., the smallest integer greater than or equal to that 
number) 

floor 

Compute the floor of each element (i.e., the largest integer less than or equal to each element) 

rent 

Round elements to the nearest integer, preserving the dtype 

modf 

Return fractional and integral parts of array as a separate array 

isnan 

Return boolean array indicating whether each value is NaN (Not a Number) 

isfinlte, isinf 

Return boolean array indicating whether each element is finite (non-inf, non-NaN) or infinite, 
respectively 

cos, cosh, sin, 
sinh, tan, tanh 

Regular and hyperbolic trigonometric functions 

arccos, arccosh, 
arcsin, arcsinh, 
arctan, arctanh 

Inverse trigonometric functions 

logical_not 

Compute truth value of not x element-wise (equivalent to -arr). 

Table 4-4. Binary universal functions 

Function 

Description 

add 

Add corresponding elements in arrays 

subtract 

Subtract elements in second array from first array 

multiply 

Multiply array elements 

divide, floor_divide Divide or floor divide (truncating the remainder) 

power 

Raise elements in first array to powers indicated in second array 

maximum, fmax 

Element-wise maximum; fmax ignores NaN 

minimum, fmin 

Element-wise minimum; fnln ignores NaN 

mod 

Element-wise modulus (remainder of division) 

copysign 

Copy sign of values in second argument to values in first argument 
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Function 


Description 


greater, greater_equal, 
less, less_equal, 
equal, not_equal 
loglcal_and, 
logical_or, loglcal_xor 


Perform element-wise comparison, yielding boolean array (equivalent to infix 
operators >, >=, <, <=, ==, !=) 

Compute element-wise truth value of logical operation (equivalent to infix operators 

& I, A ) 


4.3 Array-Oriented Programming with Arrays 

Using NumPy arrays enables you to express many kinds of data processing tasks as 
concise array expressions that might otherwise require writing loops. This practice of 
replacing explicit loops with array expressions is commonly referred to as vectoriza- 
tion. In general, vectorized array operations will often be one or two (or more) orders 
of magnitude faster than their pure Python equivalents, with the biggest impact in 
any kind of numerical computations. Later, in Appendix A, I explain broadcasting, a 
powerful method for vectorizing computations. 

As a simple example, suppose we wished to evaluate the function sqrt(x A 2 + y A 2) 
across a regular grid of values. The np. neshgrtd function takes two ID arrays and 
produces two 2D matrices corresponding to all pairs of (x, y) in the two arrays: 

In [155]: points = np.arange(-5, 5, 0.01) # 1000 equally spaced points 
In [156]: xs, ys = np.meshgrld(points, points) 


In [157]: ys 
0ut[157]: 


array( [[ -5. , 

-5. , 

-5. , .. 

., -5. , 

-5. , 

-5. ], 

[-4.99, 

-4.99, 

-4.99, .. 

., -4.99, 

-4.99, 

-4.99], 

[-4.98, 

-4.98, 

-4.98, .. 

., -4.98, 

-4.98, 

-4.98], 

• • • f 

[ 4.97, 

4.97, 

4.97, .. 

., 4.97, 

4.97, 

4.97], 

[ 4.98, 

4.98, 

4.98, .. 

., 4.98, 

4.98, 

4.98], 

[ 4.99, 

4.99, 

4.99, .. 

., 4.99, 

4.99, 

4.99]]) 


Now, evaluating the function is a matter of writing the same expression you would 
write with two points: 

In [158]: z = np.sqrt(xs ** 2 + ys ** 2) 


In [159]: z 
0ut[159]: 


array( [ [ 

7.0711, 

7.064 , 

7.0569, .. 

., 7.0499, 

7.0569, 

7.064 ], 

[ 

7.064 , 

7.0569, 

7.0499, .. 

., 7.0428, 

7.0499, 

7.0569], 

[ 

7.0569, 

7.0499, 

7.0428, .. 

., 7.0357, 

7.0428, 

7.0499], 

[ 

■ • » 

7.0499, 

7.0428, 

7.0357, .. 

., 7.0286, 

7.0357, 

7.0428], 

[ 

7.0569, 

7.0499, 

7.0428, .. 

., 7.0357, 

7.0428, 

7.0499], 

[ 

7.064 , 

7.0569, 

7.0499, .. 

., 7.0428, 

7.0499, 

7.0569]]) 
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As a preview of Chapter 9, I use matplotlib to create visualizations of this two- 
dimensional array: 

In [160]: import as 

In [161]: plt.imshow(z, cmap=plt.cm.gray); pit.colorbar( ) 

0ut[161]: <matplotlib.colorbar.Colorbar at 0x7f715e3fa630> 

In [162]: pit.title(" Image plot of $\sqrt{x A 2 + y A 2}$ for a grid of values") 
0ut[162]: <matplotlib.text.Text at 0x7f715d2de748> 

See Figure 4-3. Here I used the matplotlib function inshow to create an image plot 
from a two-dimensional array of function values. 



Figure 4-3. Plot of function evaluated on grid 


Expressing Conditional Logic as Array Operations 

The nupipy. where function is a vectorized version of the ternary expression x if con 
dition else y. Suppose we had a boolean array and two arrays of values: 
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In [165]: xarr = np.array([l.l, 1.2, 1.3, 1.4, 1.5]) 


In [166]: yarn = np.array([2.1, 2.2, 2.3, 2.4, 2.5]) 

In [167]: cond = np.array([True, False, True, True, False]) 

Suppose we wanted to take a value from xarr whenever the corresponding value in 
cond is True, and otherwise take the value from yarr. A list comprehension doing 
this might look like: 

In [168]: result = [(x if c else y) 

.: for x, y, c in zip(xarr, yarr, cond)] 


In [169]: result 

0ut[169]: [1.1000000000000001, 2.2000000000000002, 1.3, 1.3999999999999999, 2.5] 

This has multiple problems. First, it will not be very fast for large arrays (because all 
the work is being done in interpreted Python code). Second, it will not work with 
multidimensional arrays. With np.where you can write this very concisely: 

In [170]: result = np.where(cond, xarr, yarr) 

In [171]: result 

0ut[171]: array( [ 1.1, 2.2, 1.3, 1.4, 2.5]) 

The second and third arguments to np.where don’t need to be arrays; one or both of 
them can be scalars. A typical use of where in data analysis is to produce a new array 
of values based on another array. Suppose you had a matrix of randomly generated 
data and you wanted to replace all positive values with 2 and all negative values with 
-2. This is very easy to do with np.where: 

In [172]: arr = np.random.randn(4, 4) 

In [173]: arr 

0ut[173]: 

array( [ [-0.5031, -0.6223, -0.9212, -0.7262], 

[ 0.2229, 0.0513, -1.1577, 0.8167], 

[ 0.4336, 1.0107, 1.8249, -0.9975], 

[ 0.8506, -0.1316, 0.9124, 0.1882]]) 

In [174] : arr > 0 

0ut[174]: 

array([[False, False, False, False], 

[ True, True, False, True], 

[ True, True, True, False], 

[ True, False, True, True]], dtype=bool) 

In [175]: np.where(arr > 0, 2, -2) 

0ut[175]: 

array( [[ -2, -2, -2, -2], 

[ 2 , 2 , - 2 , 2 ], 
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[ 2 , 2 , 2 , - 2 ], 

[ 2 , - 2 , 2 , 2 ]]) 

You can combine scalars and arrays when using np.where. For example, I can replace 
all positive values in arr with the constant 2 like so: 

In [176]: np.where(arr > 0, 2, arr) # set only positive values to 2 
0ut[176]: 

array( [ [-0.5031, -0.6223, -0.9212, -0.7262], 

[2. ,2. , -1.1577, 2. ], 

[2. ,2. ,2. , -0.9975], 

[ 2. , -0.1316, 2. , 2. ]]) 

The arrays passed to np. where can be more than just equal-sized arrays or scalars. 


Mathematical and Statistical Methods 

A set of mathematical functions that compute statistics about an entire array or about 
the data along an axis are accessible as methods of the array class. You can use aggre¬ 
gations (often called reductions) like sum, mean, and std (standard deviation) either by 
calling the array instance method or using the top-level NumPy function. 

Here I generate some normally distributed random data and compute some aggregate 
statistics: 

In [177]: arr = np.random.randn(5, 4) 


In [178]: arr 
0ut[178]: 

array( [[ 2.1695, -0.1149, 
[ 0.7953, 0.1181, 

[ 0.1527, -1.5657, 
[-0.929 , -0.4826, 
[ 0.9809, -0.5895, 


2.0037, 0.0296], 

-0.7485, 0.585 ], 

-0.5625, -0.0327], 
-0.0363, 1.0954], 

1.5817, -0.5287]]) 


In [179]: arr.meanQ 
0ut[179]: 0.19607051119998253 


In [180]: np.mean(arr) 
Out[180] : 0.19607051119998253 


In [181]: arr.sumQ 
0ut[181]: 3.9214102239996507 

Functions like mean and sum take an optional axis argument that computes the statis¬ 
tic over the given axis, resulting in an array with one fewer dimension: 

In [182]: arr.mean(axts=l) 

0ut[182]: array( [ 1.022 , 0.1875, -0.502 , -0.0881, 0.3611]) 


In [183]: arr. sum(axis=0) 

0ut[183]: array( [ 3.1693, -2.6345, 2.2381, 1.1486]) 
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Here, arr.mean(l) means “compute mean across the columns” where arr.suni(0) 
means “compute sum down the rows.” 

Other methods like cumsum and cumprod do not aggregate, instead producing an array 
of the intermediate results: 

In [184]: arr = np.array([0, 1, 2, 3, 4, 5, 6, 7]) 

In [185]: arr .cumsum( ) 

0ut[185]: array( [ 0, 1, 3, 6, 10, 15, 21, 28]) 

In multidimensional arrays, accumulation functions like cun sum return an array of 
the same size, but with the partial aggregates computed along the indicated axis 
according to each lower dimensional slice: 

In [186]: arr = np.array([[0, 1, 2], [3, 4, 5], [6, 7, 8]]) 

In [187]: arr 
0ut[187]: 
array( [ [0, 1, 2], 

[3, 4, 5], 

[6, 7, 8]]) 

In [188]: arr .cunsum(axis=0) 

0ut[188]: 

array( [[ 0, 1, 2], 

[ 3, 5, 7], 

[ 9, 12, 15]]) 

In [189]: arr .cunprod(axts=l) 

0ut[189]: 

array( [[ 0, 0, 0], 

[ 3, 12, 60], 

[ 6, 42, 336]]) 

See Table 4-5 for a full listing. We’ll see many examples of these methods in action in 
later chapters. 


Table 4-5. Basic array statistical methods 


1 Method 

Description | 

sum 

Sum of all the elements in the array or along an axis; zero-length arrays have sum 0 

mean 

Arithmetic mean; zero-length arrays have NaN mean 

std, var 

Standard deviation and variance, respectively, with optional degrees of freedom adjustment (default 
denominator n) 

min, max 

Minimum and maximum 

argmin, argmax 

Indices of minimum and maximum elements, respectively 

cumsum 

Cumulative sum of elements starting from 0 

cumprod 

Cumulative product of elements starting from 1 
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Methods for Boolean Arrays 

Boolean values are coerced to 1 (True) and 0 (False) in the preceding methods. Thus, 
sun is often used as a means of counting T rue values in a boolean array: 

In [190]: arr = np.random. randn(100) 

In [191]: (arr > 0).sum() # Number of positive values 
0ut[191] : 42 

There are two additional methods, any and all, useful especially for boolean arrays, 
any tests whether one or more values in an array is True, while all checks if every 
value is True: 

In [192]: bools = np.array([False, False, True, False]) 

In [193]: bools. any() 

0ut[193]: True 

In [194]: bools. all() 

0ut[194]: False 

These methods also work with non-boolean arrays, where non-zero elements evalu¬ 
ate to True. 

Sorting 

Like Python’s built-in list type, NumPy arrays can be sorted in-place with the sort 
method: 

In [195]: arr = np.random. randn(6) 

In [196]: arr 

0ut[196] : array( [ 0.6095, -0.4938, 1.24 , -0.1357, 1.43 , -0.8469]) 

In [197]: arr.sort() 

In [198]: arr 

0ut[198] : array( [ -0.8469, -0.4938, -0.1357, 0.6095, 1.24 , 1.43 ]) 

You can sort each one-dimensional section of values in a multidimensional array in- 
place along an axis by passing the axis number to sort: 

In [199]: arr = np.random. randn(5, 3) 

In [200]: arr 
Out[200] : 

array( [[ 0.6033, 1.2636, -0.2555], 

[-0.4457, 0.4684, -0.9616], 

[-1.8245, 0.6254, 1.0229], 

[ 1.1074, 0.0909, -0.3501], 

[ 0.218 , -0.8948, -1.7415]]) 
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In [201]: arr.sort(l) 


In [202]: arr 
Out[202] : 

array( [ [-0.2555, 0.6033, 1.2636], 

[-0.9616, -0.4457, 0.4684], 

[-1.8245, 0.6254, 1.0229], 

[-0.3501, 0.0909, 1.1074], 

[-1.7415, -0.8948, 0.218 ]]) 

The top-level method np. sort returns a sorted copy of an array instead of modifying 
the array in-place. A quick-and-dirty way to compute the quantiles of an array is to 
sort it and select the value at a particular rank: 

In [203]: large_arr = np.random. randn(1000) 

In [204]: large_arr.sort() 

In [205]: large_arr[int(0.05 * len(targe_arr))] # 5% quantile 
Out[205] : -1.5311513550102103 

For more details on using NumPy s sorting methods, and more advanced techniques 
like indirect sorts, see Appendix A. Several other kinds of data manipulations related 
to sorting (e.g., sorting a table of data by one or more columns) can also be found in 
pandas. 

Unique and Other Set Logic 

NumPy has some basic set operations for one-dimensional ndarrays. A commonly 
used one is np. unique, which returns the sorted unique values in an array: 

In [206]: names = np.array( [ 'Bob' , 'Joe', 'Will', 'Bob’, 'Will', 'Joe', 'Joe']) 

In [207]: np.unique(names) 

Out[207] : 

array( [ 'Bob' , 'Joe', 'Will'], 
dtype='<U4' ) 

In [208]: ints = np.array([3, 3, 3, 2, 2, 1, 1, 4, 4]) 

In [209]: np.unique(ints) 

Out[209]: array([l, 2, 3, 4]) 

Contrast np. unique with the pure Python alternative: 

In [210]: sorted(set(names)) 

Out[210]: [’Bob', 'Joe', 'Will'] 

Another function, np.inld, tests membership of the values in one array in another, 
returning a boolean array: 
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In [211]: values = np.array([6, 0, 0, 3, 2, 5, 6]) 


In [212]: np.inldfvalues, [2, 3, 6]) 

Out[212]: array([ True, False, False, True, True, False, True], dtype=bool) 

See Table 4-6 for a listing of set functions in NumPy. 


Table 4-6. Array set operations 


1 Method 

Description 1 

unlque(x) 

Compute the sorted, unique elements in x 

intersectld(x, y) 

Compute the sorted, common elements in x and y 

unlonld(x, y) 

Compute the sorted union of elements 

inld(x, y) 

Compute a boolean array indicating whether each element of x is contained in y 

setdlffld(x, y) 

Set difference, elements in x that are not in y 

setxorld(x, y) 

Set symmetric differences; elements that are in either of the arrays, but not both 


4.4 File Input and Output with Arrays 

NumPy is able to save and load data to and from disk either in text or binary format. 
In this section I only discuss NumPy’s built-in binary format, since most users will 
prefer pandas and other tools for loading text or tabular data (see Chapter 6 for much 
more). 

np. save and np. load are the two workhorse functions for efficiently saving and load¬ 
ing array data on disk. Arrays are saved by default in an uncompressed raw binary 
format with file extension . npy: 

In [213]: arr = np.arange(10) 

In [214]: np.save( 1 some_array' , arr) 

If the file path does not already end in . npy, the extension will be appended. The array 
on disk can then be loaded with np. load: 

In [215]: np.load( 1 sone_array.npy' ) 

0ut[215] : array( [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]) 

You save multiple arrays in an uncompressed archive using np.savez and passing the 
arrays as keyword arguments: 

In [216]: np.savez( 1 array_archive.npz 1 , a=arr, b=arr) 

When loading an .npz file, you get back a dict-like object that loads the individual 
arrays lazily: 

In [217]: arch = np.load( , array_archive.npz' ) 

In [218]: arch [ 'b' ] 

0ut[218] : array( [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]) 
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If your data compresses well, you may wish to use nunpy. savez_compressed instead: 
In [219]: np.savez_compressed( 'arrays_compressed.npz 1 , a=arr, b=arr) 


4.5 Linear Algebra 

Linear algebra, like matrix multiplication, decompositions, determinants, and other 
square matrix math, is an important part of any array library. Unlike some languages 
like MATLAB, multiplying two two-dimensional arrays with * is an element-wise 
product instead of a matrix dot product. Thus, there is a function dot, both an array 
method and a function in the nunpy namespace, for matrix multiplication: 

In [223]: x = np.array( [ [1. , 2., 3.], [4., 5., 6.]]) 

In [224]: y = np.array( [ [6. , 23.], [-1, 7], [8, 9]]) 

In [225]: x 
0ut[225] : 

array( [ [ 1., 2., 3.], 

[ 4., 5., 6.]]) 


In [226]: y 
0ut[226] : 

array( [[ 6., 23.], 

[ -1., 7.], 

[ 8., 9.]]) 


In [227]: x.dot(y) 

0ut[227] : 

array( [[ 28., 64.], 

[ 67., 181.]]) 

x.dot(y) is equivalent to np.dot(x, y): 

In [228]: np.dot(x, y) 

0ut[228] : 

array( [[ 28., 64], 

[ 67., 181.]]) 

A matrix product between a two-dimensional array and a suitably sized one¬ 
dimensional array results in a one-dimensional array: 

In [229]: np.dot(x, np.ones(3)) 

0ut[229] : array([ 6., 15.]) 

The @ symbol (as of Python 3.5) also works as an infix operator that performs matrix 
multiplication: 

In [230]: x @ np.ones(3) 

Out[230]: array([ 6., 15.]) 
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numpy.linalg has a standard set of matrix decompositions and things like inverse 
and determinant. These are implemented under the hood via the same industry- 
standard linear algebra libraries used in other languages like MATLAB and R, such as 
BLAS, LAPACK, or possibly (depending on your NumPy build) the proprietary Intel 
MKL (Math Kernel Library): 

In [231]: from inport tnv, qr 

In [232]: X = np.random.randn(5, 5) 

In [233]: mat = X.T.dot(X) 


In [234]: inv(nat) 

0ut[234] : 

array( [[ 933.1189, 871.8258, 

[ 871.8258, 815.3929, 

[-1417.6902, -1325.9965, 
[-1460.4005, -1365.9242, 
[ 1782.1391, 1666.9347, 


-1417.6902, 

-1325.9965, 

2158.4424, 

2222.0191, 

-2711.6822, 


-1460.4005, 1782.1391], 

-1365.9242, 1666.9347], 

2222.0191, -2711.6822], 
2289.0575, -2793.422 ], 
-2793.422 , 3409.5128]]) 


In [235]: mat.dot(inv(mat)) 
0ut[235] : 


array( [ [ 1., 

0., -0., -0. 

, -0.], 

[-0., 

1., 0., 0. 

, 0-], 

[ 0-, 

0., 1., 0. 

, 0-], 

[-0., 

0., 0., 1. 

, -0.], 

[-0., 

0., 0., 0. 

, 1.]]) 

In [236]: q. 

r = qr(mat) 


In [237]: r 



0ut[237] : 



array( [[ -1.6914, 4.38 , 

0.1757, 0.4075, -0.7838], 

[ 0- 

, -2.6436, 

0.1939, -3.072 , -1.0702], 

[ 0- 

, 0. 

-0.8138, 1.5414, 0.6155], 

[ 0- 

, 0. 

0. , -2.6445, -2.1669], 

[ 0- 

, 0. 

0. ,0. , 0.0002]]) 


The expression X . T. dot(X) computes the dot product of X with its transpose X . T. 
See Table 4-7 for a list of some of the most commonly used linear algebra functions. 


Table 4-7. Commonly used numpy.linalg functions 


Function Description 


diag Return the diagonal (or off-diagonal) elements of a square matrix as a 1D array, or convert a 1D array into a 
square matrix with zeros on the off-diagonal 
dot Matrix multiplication 

trace Compute the sum of the diagonal elements 

det Compute the matrix determinant 
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Function Description 


eig Compute the eigenvalues and eigenvectors of a square matrix 

in v Compute the inverse of a square matrix 

pinv Compute the Moore-Penrose pseudo-inverse of a matrix 

q r Compute the QR decomposition 

svd Compute the singular value decomposition (SVD) 

solve Solve the linear system Ax = b for x, where A is a square matrix 

Istsq Compute the least-squares solution to Ax = b 

4.6 Pseudorandom Number Generation 

The numpy.random module supplements the built-in Python random with functions 
for efficiently generating whole arrays of sample values from many kinds of probabil¬ 
ity distributions. For example, you can get a 4 x 4 array of samples from the standard 
normal distribution using normal: 

In [238]: samples = np.random. normal(slze=(4, 4)) 

In [239]: samples 
0ut[239] : 

array( [[ 0.5732, 0.1933, 0.4429, 1.2796], 

[ 0.575 , 0.4339, -0.7658, -1.237 ], 

[-0.5367, 1.8545, -0.92 , -0.1082], 

[ 0.1525, 0.9435, -1.0953, -0.144 ]]) 

Python’s built-in random module, by contrast, only samples one value at a time. As 
you can see from this benchmark, numpy. random is well over an order of magnitude 
faster for generating very large samples: 

In [240]: from Import normalvarlate 

In [241]: N = 1000000 

In [242]: %tlmelt samples = [normalvarlate(0, 1) for _ In range(N)] 

1.77 s +- 126 ms per loop (mean +- std. dev. of 7 runs, 1 loop each) 

In [243]: %tlmelt np.random.normal(slze=N) 

61.7 ms +- 1.32 ms per loop (mean +- std. dev. of 7 runs, 10 loops each) 

We say that these are pseudorandom numbers because they are generated by an algo¬ 
rithm with deterministic behavior based on the seed of the random number genera¬ 
tor. You can change NumPy’s random number generation seed using 
np.random.seed: 

In [244]: np.random. seed(1234) 
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The data generation functions in nunpy. random use a global random seed. To avoid 
global state, you can use numpy.random.RandomState to create a random number 
generator isolated from others: 

In [245]: rng = np.random.RandomState( 1234) 

In [246]: rng.randn(10) 

Out [246] : 

array([ 0.4714, -1.191 , 1.4327, -0.3127, -0.7206, 0.8872, 0.8596, 

-0.6365, 0.0157, -2.2427]) 

See Table 4-8 for a partial list of functions available in numpy. random. I’ll give some 
examples of leveraging these functions’ ability to generate large arrays of samples all 
at once in the next section. 


Table 4-8. Partial list of numpy.random functions 


1 Function 

Description 1 

seed 

Seed the random number generator 

permutation 

Return a random permutation of a sequence, or return a permuted range 

shuffle 

Randomly permute a sequence in-place 

rand 

Draw samples from a uniform distribution 

randint 

Draw random integers from a given low-to-high range 

randn 

Draw samples from a normal distribution with mean 0 and standard deviation 1 (MATLAB-like interface) 

binomial 

Draw samples from a binomial distribution 

normal 

Draw samples from a normal (Gaussian) distribution 

beta 

Draw samples from a beta distribution 

chisquare 

Draw samples from a chi-square distribution 

gamma 

Draw samples from a gamma distribution 

uniform 

Draw samples from a uniform [0,1) distribution 


4.7 Example: Random Walks 

The simulation of random walks provides an illustrative application of utilizing array 
operations. Let’s first consider a simple random walk starting at 0 with steps of 1 and 
-1 occurring with equal probability. 

Here is a pure Python way to implement a single random walk with 1,000 steps using 
the built-in random module: 

In [247]: import 

position = 0 

.: walk = [position] 

.: steps = 1000 

.: for i in range(steps) : 

.: step = 1 if random.randint(0, 1) else -1 

.: position += step 
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walk.append(position) 


See Figure 4-4 for an example plot of the first 100 values on one of these random 
walks: 



Figure 4-4. A simple random walk 

You might make the observation that walk is simply the cumulative sum of the ran¬ 
dom steps and could be evaluated as an array expression. Thus, I use the np. random 
module to draw 1,000 coin flips at once, set these to 1 and -1, and compute the 
cumulative sum: 

In [251]: nsteps = 1000 

In [252]: draws = np.random.randint(0, 2, size=nsteps) 

In [253]: steps = np.where(draws > 0, 1, -1) 

In [254]: walk = steps.cumsum( ) 

From this we can begin to extract statistics like the minimum and maximum value 
along the walk’s trajectory: 

In [255]: walk.min() 

0ut[255] : -3 
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In [256]: walk.maxQ 
Out[256] : 31 

A more complicated statistic is the first crossing time, the step at which the random 
walk reaches a particular value. Here we might want to know how long it took the 
random walk to get at least 10 steps away from the origin 0 in either direction, 
np. abs(walk) >= 10 gives us a boolean array indicating where the walk has reached 
or exceeded 10, but we want the index of the first 10 or -10. Turns out, we can com¬ 
pute this using argnax, which returns the first index of the maximum value in the 
boolean array (True is the maximum value): 

In [257]: (np.abs(walk) >= 10).argnax() 

0ut[257] : 37 

Note that using argnax here is not always efficient because it always makes a full scan 
of the array. In this special case, once a True is observed we know it to be the maxi¬ 
mum value. 

Simulating Many Random Walks at Once 

If your goal was to simulate many random walks, say 5,000 of them, you can generate 
all of the random walks with minor modifications to the preceding code. If passed a 
2-tuple, the nunpy. random functions will generate a two-dimensional array of draws, 
and we can compute the cumulative sum across the rows to compute all 5,000 ran¬ 
dom walks in one shot: 

In [258]: nwalks = 5000 
In [259]: nsteps = 1000 

In [260]: draws = np.random.randint(0, 2, size=(nwalks, nsteps)) # 0 or 1 
In [261]: steps = np.where(draws > 0, 1, -1) 

In [262]: walks = steps.cumsum(l) 


In [263]: walks 
0ut[263] : 


array([[ 

1, 

0, 

1, .. 

8, 

7, 

8], 

[ 

1, 

0, 

- 1 , .. 

34, 

33, 

32], 

[ 

1, 

0, 

- 1 , .. 

4, 

5, 

4], 

[ 

• f 

1, 

2 , 

1, .. 

., 24, 

25, 

26], 

[ 

1, 

2, 

3, .. 

., 14, 

13, 

14], 

[ 

-1, 

-2, 

-3, .. 

., -24, 

-23, 

-22]]) 


Now, we can compute the maximum and minimum values obtained over all of the 
walks: 
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In [264]: walks.max() 

0ut[264] : 138 

In [265]: walks.nln() 

Out[265] : -133 

Out of these walks, let’s compute the minimum crossing time to 30 or -30. This is 
slightly tricky because not all 5,000 of them reach 30. We can check this using the any 
method: 

In [266]: hits30 = (np.abs(walks) >= 30).any(l) 

In [267]: hits30 

0ut[267]: array( [False, True, False, ..., False, True, False], dtype=bool) 

In [268]: hits30.sum( ) # Nunber that hit 30 or -30 
0ut[268] : 3410 

We can use this boolean array to select out the rows of walks that actually cross the 
absolute 30 level and call argmax across axis 1 to get the crossing times: 

In [269]: crossing_tlnes = (np.abs(walks[hits30]) >= 30). argmax(l) 

In [270]: crossing_tlmes.mean() 

Out[270] : 498.88973607038122 

Feel free to experiment with other distributions for the steps other than equal-sized 
coin flips. You need only use a different random number generation function, like 
normal to generate normally distributed steps with some mean and standard 
deviation: 

In [271]: steps = np.random.normal(loc=0, scale=0.25, 

.: size=(nwalks, nsteps)) 

4.8 Conclusion 

While much of the rest of the book will focus on building data wrangling skills with 
pandas, we will continue to work in a similar array-based style. In Appendix A, we 
will dig deeper into NumPy features to help you further develop your array comput¬ 
ing skills. 
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CHAPTER 5 


Getting Started with pandas 


pandas will be a major tool of interest throughout much of the rest of the book. It 
contains data structures and data manipulation tools designed to make data cleaning 
and analysis fast and easy in Python, pandas is often used in tandem with numerical 
computing tools like NumPy and SciPy, analytical libraries like statsmodels and 
scikit-learn, and data visualization libraries like matplotlib. pandas adopts significant 
parts of NumPy’s idiomatic style of array-based computing, especially array-based 
functions and a preference for data processing without for loops. 

While pandas adopts many coding idioms from NumPy, the biggest difference is that 
pandas is designed for working with tabular or heterogeneous data. NumPy, by con¬ 
trast, is best suited for working with homogeneous numerical array data. 

Since becoming an open source project in 2010, pandas has matured into a quite 
large library that’s applicable in a broad set of real-world use cases. The developer 
community has grown to over 800 distinct contributors, who’ve been helping build 
the project as they’ve used it to solve their day-to-day data problems. 

Throughout the rest of the book, I use the following import convention for pandas: 

In [1]: import as 

Thus, whenever you see pd. in code, it’s referring to pandas. You may also find it eas¬ 
ier to import Series and DataFrame into the local namespace since they are so fre¬ 
quently used: 

In [2]: from import Series, DataFrame 
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5.1 Introduction to pandas Data Structures 

To get started with pandas, you will need to get comfortable with its two workhorse 
data structures: Series and DataFrame. While they are not a universal solution for 
every problem, they provide a solid, easy-to-use basis for most applications. 

Series 

A Series is a one-dimensional array-like object containing a sequence of values (of 
similar types to NumPy types) and an associated array of data labels, called its index. 
The simplest Series is formed from only an array of data: 

In [11]: obj = pd.Series([4, 7, -5, 3]) 

In [12]: obj 
0ut[12] : 

0 4 

1 7 

2 -5 

3 3 

dtype: tnt64 

The string representation of a Series displayed interactively shows the index on the 
left and the values on the right. Since we did not specify an index for the data, a 
default one consisting of the integers 0 through N - 1 (where N is the length of the 
data) is created. You can get the array representation and index object of the Series via 
its values and index attributes, respectively: 

In [13]: obj.values 

0ut[13]: array([ 4, 1, -5, 3]) 

In [14]: obj.index # like range(4) 

0ut[14]: RangeIndex(start=0, stop=4, step=l) 

Often it will be desirable to create a Series with an index identifying each data point 
with a label: 

In [15]: obj2 = pd.Serles( [4, 7, -5, 3], index: ['d', 'b', 'a', 'c' ]) 

In [16]: obj2 
0ut[16] : 
d 4 
b 7 
a -5 
c 3 

dtype: int64 
In [17]: obj2.index 

0ut[17]: Index(['d', 'b', 'a 1 , 'c'], dtype=' object' ) 
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Compared with NumPy arrays, you can use labels in the index when selecting single 
values or a set of values: 

In [18]: obj2 [ 'a' ] 

0ut[18] : -5 

In [19]: obj2[' d' ] = 6 

In [20]: obj2[[ 'c' , 'a', 'd'] ] 

Out[20] : 
c 3 
a -5 
d 6 

dtype: tnt64 

Here [' c', ' a', ' d' ] is interpreted as a list of indices, even though it contains 
strings instead of integers. 

Using NumPy functions or NumPy-like operations, such as filtering with a boolean 
array, scalar multiplication, or applying math functions, will preserve the index-value 
link: 

In [21]: obj2[obj2 > 0] 

0ut[21] : 
d 6 
b 7 
c 3 

dtype: tnt64 

In [22]: obj2 * 2 
0ut[22] : 
d 12 
b 14 

a -10 
c 6 
dtype: tnt64 

In [23]: np.exp(obj2) 

0ut[23] : 

d 403.428793 

b 1096.633158 

a 0.006738 

c 20.085537 

dtype: float64 

Another way to think about a Series is as a fixed-length, ordered diet, as it is a map¬ 
ping of index values to data values. It can be used in many contexts where you might 
use a diet: 

In [24]: 'b' in obj2 
0ut[24]: True 
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In [25]: 'e' in obj2 
Out[25]: False 

Should you have data contained in a Python diet, you can create a Series from it by 
passing the diet: 

In [26]: sdata = {'Ohio': 35000, 'Texas’: 71000, ’Oregon': 16000, 'Utah': 5000} 

In [27]: obj3 = pd.Series(sdata) 

In [28]: obj3 
0ut[28] : 

Ohio 35000 

Oregon 16000 

Texas 71000 

Utah 5000 

dtype: int64 

When you are only passing a diet, the index in the resulting Series will have the diet’s 
keys in sorted order. You can override this by passing the diet keys in the order you 
want them to appear in the resulting Series: 

In [29]: states = ['California 1 , 'Ohio', 'Oregon', 'Texas'] 

In [30]: obj4 = pd.Series(sdata, index=states) 

In [31]: obj4 
0ut[31] : 

California NaN 

Ohio 35000.0 

Oregon 16000.0 

Texas 71000.0 

dtype: float64 

Here, three values found in sdata were placed in the appropriate locations, but since 
no value for ' California' was found, it appears as NaN (not a number), which is con¬ 
sidered in pandas to mark missing or NA values. Since 'Utah' was not included in 
states, it is excluded from the resulting object. 

I will use the terms “missing” or “NA” interchangeably to refer to missing data. The 
isnull and notnull functions in pandas should be used to detect missing data: 


In [32]: pd.isnull(obj4) 
0ut[32] : 

California True 


Ohio 

Oregon 

Texas 

dtype: bool 


False 

False 

False 


In [33]: pd.notnull(obj4) 
0ut[33] : 
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California 

False 

Ohio 

True 

Oregon 

True 

Texas 

T rue 

dtype: bool 



Series also has these as instance methods: 


In [34]: obj4.isnull() 
0ut[34] : 

California 

True 

Ohio 

False 

Oregon 

False 

Texas 

dtype: bool 

False 


I discuss working with missing data in more detail in Chapter 7. 

A useful Series feature for many applications is that it automatically aligns by index 
label in arithmetic operations: 

In [35]: obj3 
0ut[35] : 

Ohio 35000 

Oregon 16000 

Texas 71000 

Utah 5000 

dtype: tnt64 

In [36]: obj4 
0ut[36] : 

California NaN 

Ohio 35000.0 

Oregon 16000.0 

Texas 71000.0 

dtype: float64 


In [37]: obj3 + obj4 
0ut[37] : 

California NaN 

Ohio 70000.0 

Oregon 32000.0 

Texas 142000.0 

Utah NaN 

dtype: float64 

Data alignment features will be addressed in more detail later. If you have experience 
with databases, you can think about this as being similar to a join operation. 

Both the Series object itself and its index have a name attribute, which integrates with 
other key areas of pandas functionality: 


5.1 Introduction to pandas Data Structures | 127 



In [38]: obj4.name = 'population' 

In [39]: obj4.index.name = 'state' 

In [40]: obj4 
Out[40] : 
state 

California NaN 

Ohio 35000.0 

Oregon 16000.0 

Texas 71000.0 

Name: population, dtype: float64 

A Series’s index can be altered in-place by assignment: 

In [41]: obj 
0ut[41] : 

0 4 

1 7 

2 -5 

3 3 

dtype: tnt64 

In [42]: obj.index = ['Bob', 'Steve 1 , 'Jeff', 'Ryan'] 

In [43]: obj 
0ut[43] : 

Bob 4 

Steve 7 

Jeff -5 

Ryan 3 

dtype: int64 

DataFrame 

A DataFrame represents a rectangular table of data and contains an ordered collec¬ 
tion of columns, each of which can be a different value type (numeric, string, 
boolean, etc.). The DataFrame has both a row and column index; it can be thought of 
as a diet of Series all sharing the same index. Under the hood, the data is stored as one 
or more two-dimensional blocks rather than a list, diet, or some other collection of 
one-dimensional arrays. The exact details of DataFrame’s internals are outside the 
scope of this book. 



While a DataFrame is physically two-dimensional, you can use it to 
represent higher dimensional data in a tabular format using hier¬ 
archical indexing, a subject we will discuss in Chapter 8 and an 
ingredient in some of the more advanced data-handling features in 
pandas. 
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There are many ways to construct a DataFrame, though one of the most common is 
from a diet of equal-length lists or NumPy arrays: 

data = {'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada', 'Nevada'], 

'year': [2000, 2001, 2002, 2001, 2002, 2003], 

'pop': [1.5, 1.7, 3.6, 2.4, 2.9, 3.2]} 
frame = pd.DataFrame(data) 

The resulting DataFrame will have its index assigned automatically as with Series, and 
the columns are placed in sorted order: 


In 

[45] 

: frame 


0ut[45] 




pop 

state 

year 

0 

1.5 

Ohio 

2000 

1 

1.7 

Ohio 

2001 

2 

3.6 

Ohio 

2002 

3 

2.4 

Nevada 

2001 

4 

2.9 

Nevada 

2002 

5 

3.2 

Nevada 

2003 


If you are using the Jupyter notebook, pandas DataFrame objects will be displayed as 
a more browser-friendly HTML table. 

For large DataFrames, the head method selects only the first five rows: 


In 

[46] 

: frame.head() 

0ut[46] 




pop 

state 

year 

0 

1.5 

Ohio 

2000 

1 

1.7 

Ohio 

2001 

2 

3.6 

Ohio 

2002 

3 

2.4 

Nevada 

2001 

4 

2.9 

Nevada 

2002 


If you specify a sequence of columns, the DataFrames columns will be arranged in 
that order: 

In [47]: pd.DataFrame(data, columns=[ 'year' , 'state', 'pop']) 


0ut[47] : 
year 

state 

pop 

0 

2000 

Ohio 

1.5 

i 

2001 

Ohio 

1.7 

2 

2002 

Ohio 

3.6 

3 

2001 

Nevada 

2.4 

4 

2002 

Nevada 

2.9 

5 

2003 

Nevada 

3.2 


If you pass a column that isn’t contained in the diet, it will appear with missing values 
in the result: 

In [48]: frame2 = pd,DataFrame(data, columns=[' year' , 'state', 'pop', 'debt'], 
....: index=[ 'one' , 'two', 'three', 'four', 

....: 'five', 'six']) 
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In [49]: frame2 
0ut[49] : 



year 

state 

pop 

debt 

one 

2000 

Ohio 

1.5 

NaN 

two 

2001 

Ohio 

1.7 

NaN 

three 

2002 

Ohio 

3.6 

NaN 

four 

2001 

Nevada 

2.4 

NaN 

five 

2002 

Nevada 

2.9 

NaN 

six 

2003 

Nevada 

3.2 

NaN 


In [50]: frame2.columns 

Out[50]: Index( [ 'year' , 'state 1 , 'pop 1 , 'debt'], dtype= 'object' ) 

A column in a DataFrame can be retrieved as a Series either by diet-like notation or 
by attribute: 

In [51]: frame2[ 'state' ] 

0ut[51] : 

one Ohio 

two Ohio 

three Ohio 

four Nevada 

five Nevada 

six Nevada 

Name: state, dtype: object 

In [52]: frame2.year 

0ut[52] : 

one 2000 

two 2001 

three 2002 

four 2001 

five 2002 

six 2003 

Name: year, dtype: int64 



Attribute-like access (e.g., frame2.year) and tab completion of col¬ 
umn names in IPython is provided as a convenience. 

frame2[column] works for any column name, but frameZ.column 
only works when the column name is a valid Python variable 
name. 


Note that the returned Series have the same index as the DataFrame, and their name 
attribute has been appropriately set. 

Rows can also be retrieved by position or name with the special loc attribute (much 
more on this later): 
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In [53]: frame2. loc[' three' ] 

Out[53] : 
year 2002 

state Ohio 

pop 3.6 

debt NaN 

Name: three, dtype: object 

Columns can be modified by assignment. For example, the empty 'debt' column 
could be assigned a scalar value or an array of values: 


In [54] 

: frame2[ 1 debt 1 ] 

16.5 

In [55] 

: frame2 



0ut[55] 

year 

state 

pop 

debt 

one 

2000 

Ohio 

1.5 

16.5 

two 

2001 

Ohio 

1.7 

16.5 

three 

2002 

Ohio 

3.6 

16.5 

four 

2001 

Nevada 

2.4 

16.5 

five 

2002 

Nevada 

2.9 

16.5 

six 

2003 

Nevada 

3.2 

16.5 

In [56] 

: frame2[ 'debt' ] 

np.arange(6. ) 

In [57] 

: frame2 



0ut[57] 

year 

state 

pop 

debt 

one 

2000 

Ohio 

1.5 

0.0 

two 

2001 

Ohio 

1.7 

1.0 

three 

2002 

Ohio 

3.6 

2.0 

four 

2001 

Nevada 

2.4 

3.0 

five 

2002 

Nevada 

2.9 

4.0 

six 

2003 

Nevada 

3.2 

5.0 


When you are assigning lists or arrays to a column, the values length must match the 
length of the DataFrame. If you assign a Series, its labels will be realigned exactly to 
the DataFrame’s index, inserting missing values in any holes: 

In [58]: vat = pd.Series([-1.2, -1.5, -1.7], index=[’ two 1 , 'four', 'five']) 

In [59]: frame2[ 1 debt' ] = val 


In [60]: frame2 
Out[60] : 



year 

state 

pop 

debt 

one 

2000 

Ohio 

1.5 

NaN 

two 

2001 

Ohio 

1.7 

-1.2 

three 

2002 

Ohio 

3.6 

NaN 

four 

2001 

Nevada 

2.4 

-1.5 

five 

2002 

Nevada 

2.9 

-1.7 

six 

2003 

Nevada 

3.2 

NaN 
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Assigning a column that doesn’t exist will create a new column. The del keyword will 
delete columns as with a diet. 

As an example of del, I first add a new column of boolean values where the state 
column equals 1 Ohio': 

In [61]: frame2[ 'eastern' ] = frame2.state == 'Ohio' 


In [62]: frame2 
0ut[62] : 



year 

state 

pop 

debt 

eastern 

one 

2000 

Ohio 

1.5 

NaN 

True 

two 

2001 

Ohio 

1.7 

-1.2 

True 

three 

2002 

Ohio 

3.6 

NaN 

True 

four 

2001 

Nevada 

2.4 

-1.5 

False 

five 

2002 

Nevada 

2.9 

-1.7 

False 

six 

2003 

Nevada 

3.2 

NaN 

False 



New columns cannot be created with the frame2.eastern syntax. 


The del method can then be used to remove this column: 

In [63]: del frame2[' eastern' ] 

In [64]: frame2.columns 

0ut[64]: Index( [ 1 year 1 , 'state 1 , 'pop 1 , 'debt'], dtype= 'object' ) 



The column returned from indexing a DataFrame is a view on the 
underlying data, not a copy. Thus, any in-place modifications to the 
Series will be reflected in the DataFrame. The column can be 
explicitly copied with the Series’s copy method. 


Another common form of data is a nested diet of diets: 

In [65]: pop = {'Nevada': {2001: 2.4, 2002: 2.9}, 

_: 'Ohio 1 : {2000: 1.5, 2001: 1.7, 2002: 3.6}} 

If the nested diet is passed to the DataFrame, pandas will interpret the outer diet keys 
as the columns and the inner keys as the row indices: 

In [66]: frame3 = pd.DataFrame(pop) 


In [67]: frame3 
0ut[67] : 

Nevada Ohio 
2000 NaN 1.5 
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2001 2.4 1.7 

2002 2.9 3.6 

You can transpose the DataFrame (swap rows and columns) with similar syntax to a 
NumPy array: 


In [68]: frame3.T 
0ut[68] : 


Nevada 

2000 

NaN 

2001 

2.4 

2002 

2.9 

Ohio 

1.5 

1.7 

3.6 


The keys in the inner diets are combined and sorted to form the index in the result. 
This isn’t true if an explicit index is specified: 

In [69]: pd.DataFrame(pop, index=[2001, 2002, 2003]) 

0ut[69] : 


2001 

Nevada 

2.4 

Ohio 

1.7 

2002 

2.9 

3.6 

2003 

NaN 

NaN 


Diets of Series are treated in much the same way: 

In [70]: pdata = {'Ohio': frame3[' Ohio' ][:-1], 

....: 'Nevada': frane3[ 'Nevada '][:2]} 

In [71]: pd.DataFrame(pdata) 

0ut[71] : 

Nevada Ohio 

2000 NaN 1.5 

2001 2.4 1.7 

For a complete list of things you can pass the DataFrame constructor, see Table 5-1. 

If a DataFrame’s index and columns have their name attributes set, these will also be 
displayed: 

In [72]: frame3.index.name = 'year'; frame3.colunns.name = 'state' 

In [73]: frame3 
0ut[73] : 

state Nevada Ohio 

year 

2000 NaN 1.5 

2001 2.4 1.7 

2002 2.9 3.6 

As with Series, the values attribute returns the data contained in the DataFrame as a 
two-dimensional ndarray: 

In [74]: frame3.values 
0ut[74] : 

array([[ nan, 1.5], 
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[ 2 . 4 , 1.7], 

[ 2.9, 3.6]]) 

If the DataFrame’s columns are different dtypes, the dtype of the values array will be 
chosen to accommodate all of the columns: 


In [75]: frame2.values 
0ut[75] : 


array( [ [2000, 

'Ohio' , 

1.5, nan]. 

[2001, 

'Ohio' , 

1.7, -1.2], 

[2002, 

'Ohio' , 

3.6, nan]. 

[2001, 

'Nevada 

', 2.4, -1.5], 

[2002, 

1 Nevada 

', 2.9, -1.7], 

[2003, 

'Nevada 

1 , 3.2, nan]], dtype=object) 


Table 5-1. Possible data inputs to DataFrame constructor 


1 Type 

Notes I 

2D ndarray 

A matrix of data, passing optional row and column labels 

diet of arrays, lists, or tuples 

Each sequence becomes a column in the DataFrame; all sequences must be the same length 

NumPy structured/record 
array 

Treated as the "diet of arrays" case 

diet of Series 

Each value becomes a column; indexes from each Series are unioned together to form the 
result's row index if no explicit index is passed 

diet of diets 

Each inner diet becomes a column; keys are unioned to form the row index as in the "diet of 
Series" case 

List of diets or Series 

Each item becomes a row in the DataFrame; union of diet keys or Series indexes become the 
DataFrame's column labels 

List of lists or tuples 

Treated as the "2D ndarray" case 

Another DataFrame 

The DataFrame's indexes are used unless different ones are passed 

NumPy MaskedArray 

Like the "2D ndarray" case except masked values become NA/missing in the DataFrame result 


Index Objects 

pandas’s Index objects are responsible for holding the axis labels and other metadata 
(like the axis name or names). Any array or other sequence of labels you use when 
constructing a Series or DataFrame is internally converted to an Index: 

In [76]: obj = pd. Serles(range(3) , index=['a', ' b' , 'c' ]) 

In [77]: index = obj.index 
In [78]: index 

0ut[78]: Index(['a', 'b' , 'c'], dtype= 1 object 1 ) 

In [79]: index[l:] 

0ut[79]: Index(['b', 1 c' ], dtype=' object' ) 

Index objects are immutable and thus can’t be modified by the user: 
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Index [ 1] = ' d' # TypeError 


Immutability makes it safer to share Index objects among data structures: 

In [80]: labels = pd.Index(np.arange(3)) 

In [81]: labels 

0ut[81]: Int64Index( [0, 1, 2], dtype= 'lnt64' ) 

In [82]: obj2 = pd.Serles( [1 . 5, -2.5, 0], lndex=labels) 

In [83]: obj2 
0ut[83] : 

0 1.5 

1 -2.5 

2 0.0 
dtype: float64 

In [84]: obj2.Index is labels 
0ut[84]: True 



Some users will not often take advantage of the capabilities pro¬ 
vided by indexes, but because some operations will yield results 
containing indexed data, it’s important to understand how they 
work. 


In addition to being array-like, an Index also behaves like a fixed-size set: 

In [85]: frame3 
0ut[85] : 

state Nevada Ohio 

year 

2000 NaN 1.5 

2001 2.4 1.7 

2002 2.9 3.6 

In [86]: frame3.columns 

0ut[86]: Index( [ 1 Nevada' , 'Ohio'], dtype=' object' , name=' state 1 ) 

In [87]: 'Ohio' in frame3.columns 
0ut[87]: True 

In [88]: 2003 in frame3.index 
0ut[88]: False 

Unlike Python sets, a pandas Index can contain duplicate labels: 

In [89]: dup_labels = pd.Index( [ 1 foo' , 'foo', 'bar', 'bar']) 

In [90]: dup_labels 

Out[90]: Index( [ 'foo' , 'foo', 'bar', 'bar'], dtype= 'object' ) 
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Selections with duplicate labels will select all occurrences of that label. 

Each Index has a number of methods and properties for set logic, which answer other 
common questions about the data it contains. Some useful ones are summarized in 
Table 5-2. 


Table 5-2. Some Index methods and properties 


1 Method 

Description I 

append 

Concatenate with additional Index objects, producing a new Index 

difference 

Compute set difference as an Index 

intersection 

Compute set intersection 

union 

Compute set union 

isin 

Compute boolean array indicating whether each value is contained in the passed collection 

delete 

Compute new Index with element at index i deleted 

drop 

Compute new Index by deleting passed values 

insert 

Compute new Index by inserting element at index i 

is_monotonic 

Returns True if each element is greater than or equal to the previous element 

is_unique 

Returns T rue if the Index has no duplicate values 

unique 

Compute the array of unique values in the Index 


5.2 Essential Functionality 

This section will walk you through the fundamental mechanics of interacting with the 
data contained in a Series or DataFrame. In the chapters to come, we will delve more 
deeply into data analysis and manipulation topics using pandas. This book is not 
intended to serve as exhaustive documentation for the pandas library; instead, we’ll 
focus on the most important features, leaving the less common (i.e., more esoteric) 
things for you to explore on your own. 

Reindexing 

An important method on pandas objects is reindex, which means to create a new 
object with the data conformed to a new index. Consider an example: 

In [91]: obj = pd.Series([4.5, 7.2, -5.3, 3.6], index=['d', ' b' , 'a', 'c']) 

In [92]: obj 
0ut[92] : 
d 4.5 
b 7.2 
a -5.3 
c 3.6 
dtype: float64 
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Calling reindex on this Series rearranges the data according to the new index, intro¬ 
ducing missing values if any index values were not already present: 

In [93]: obj2 = obj.relndex( [ 'a' , 'b', 'c', 'd', 'e' ]) 

In [94]: obj2 
0ut[94] : 
a -5.3 
b 7.2 
c 3.6 
d 4.5 
e NaN 
dtype: float64 

For ordered data like time series, it may be desirable to do some interpolation or fill¬ 
ing of values when reindexing. The method option allows us to do this, using a 
method such as f f ill, which forward-fills the values: 

In [95]: obj3 = pd.Series( [' blue' , 'purple', 'yellow'], lndex=[0, 2, 4]) 

In [96]: obj3 
0ut[96] : 

0 blue 

2 purple 

4 yellow 

dtype: object 

In [97]: obj3. reindex(range(6), method=' ffill' ) 

0ut[97] : 

0 blue 

1 blue 

2 purple 

3 purple 

4 yellow 

5 yellow 
dtype: object 

With DataFrame, reindex can alter either the (row) index, columns, or both. When 
passed only a sequence, it reindexes the rows in the result: 

In [98]: frame = pd.DataFrame(np. arange(9) . reshape( (3, 3)), 

....: index=['a', 'c', 'd' ], 

....: columns=[ 'Ohio' , 'Texas', 'California']) 


In [99]: frame 
0ut[99] : 

Ohio Texas California 
a 0 1 2 

c 3 4 5 

d 6 7 8 

In [100]: frame2 = frame.reindex( [ 'a' , ' b' , 'c', ' d' ]) 
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In [101]: 

: frame2 


Out[101] : 
Ohio 

Texas 

California 

a 0.0 

1.0 

2.0 

b NaN 

NaN 

NaN 

c 3.0 

4.0 

5.0 

d 6.0 

7.0 

8.0 

: columns can be reindexed with the columns keyword: 

In [102]: 

: states 

= ['Texas', 'Utah', 'California'] 

In [103]: 

: frame. 

reindex(columns=states) 

Out[103] : 
Texas 

Utah 

California 

a 1 

NaN 

2 

c 4 

NaN 

5 

d 7 

NaN 

8 


See Table 5-3 for more about the arguments to reindex. 

As we’ll explore in more detail, you can reindex more succinctly by label-indexing 
with loc, and many users prefer to use it exclusively: 

In [104]: frame.loc[ [ 1 a' , ' b' , 'c' , ’d’], states] 


Out[104] : 
Texas 

Utah 

California 

a 1.0 

NaN 

2.0 

b NaN 

NaN 

NaN 

c 4.0 

NaN 

5.0 

d 7.0 

NaN 

8.0 


Table 5-3. reindex function arguments 


Argument Description 


index 

method 

flll_vatue 

limit 

tolerance 

level 

copy 


New sequence to use as index. Can be index instance or any other sequence-like Python data structure. An 
Index will be used exactly as is without any copying. 

Interpolation (fill) method; ' f f ill' fills forward, while ' bf ill' fills backward. 

Substitute value to use when introducing missing data by reindexing. 

When forward- or backfilling, maximum size gap (in number of elements) to fill. 

When forward- or backfilling, maximum size gap (in absolute numeric distance) to fill for inexact matches. 
Match simple Index on level of Multiindex; otherwise select subset of. 

If T rue, always copy underlying data even if new index is equivalent to old index; if False, do not copy 
the data when the indexes are equivalent. 


Dropping Entries from an Axis 

Dropping one or more entries from an axis is easy if you already have an index array 
or list without those entries. As that can require a bit of munging and set logic, the 
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drop method will return a new object with the indicated value or values deleted from 
an axis: 

In [105]: obj = pd.Serles(np.arange(5. ), Index: ['a 1 , ' b' , 'c', 'd' , 'e']) 

In [106] : obj 
Out[106] : 
a 0.0 

b 1.0 

c 2.0 

d 3.0 

e 4.0 

dtype: float64 

In [107]: new_obj = obj,drop( 'c' ) 

In [108]: new_obj 
Out[108] : 
a 0.0 

b 1.0 

d 3.0 

e 4.0 

dtype: float64 

In [109]: obj,drop( [’ d' , 'c' ]) 

Out[109] : 
a 0.0 

b 1.0 

e 4.0 

dtype: float64 

With DataFrame, index values can be deleted from either axis. To illustrate this, we 
first create an example DataFrame: 

In [110]: data = pd.DataFrame(np.arange(16).reshape((4, 4)), 

.: index=['Ohio 1 , 'Colorado', 'Utah', 'New York'], 

.: colunns=['one', 'two', 'three', 'four']) 


In [111]: data 
Out[lll]: 


Ohio 

one 

0 

two 

1 

three 

2 

four 

3 

Colorado 

4 

5 

6 

7 

Utah 

8 

9 

10 

11 

New York 

12 

13 

14 

15 


Calling drop with a sequence of labels will drop values from the row labels (axis 0): 

In [112]: data. drop([ 'Colorado' , 'Ohio']) 

0ut[112] : 



one 

two 

three 

four 

Utah 

8 

9 

10 

11 

New York 

12 

13 

14 

15 
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You can drop values from the columns by passing axis=l or axis=' columns' 

axls=i) 


In [113]: 

data.drop( 1 

two' , 

0ut[113] : 

one 

three 

four 

Ohio 

0 

2 

3 

Colorado 

4 

6 

7 

Utah 

8 

10 

11 

New York 

12 

14 

15 

In [114]: 

data.drop([ 

' two 1 

0ut[114] : 

one 

three 


Ohio 

0 

2 


Colorado 

4 

6 


Utah 

8 

10 


New York 

12 

14 



'four'], axis= 1 columns 1 ) 


Many functions, like drop, which modify the size or shape of a Series or DataFrame, 
can manipulate an object in-place without returning a new object: 

In [115]: obj,drop( 'c 1 , tnplace=True) 


In [116]: obj 
0ut[116]: 
a 0.0 
b 1.0 
d 3.0 
e 4.0 
dtype: float64 

Be careful with the inplace, as it destroys any data that is dropped. 


Indexing, Selection, and Filtering 

Series indexing (obj [... ]) works analogously to NumPy array indexing, except you 
can use the Series’s index values instead of only integers. Here are some examples of 
this: 


In [117]: obj = pd.Serles(np.arange(4. ), index=['a', ' b' , 'c', 'd' ]) 

In [118]: obj 
0ut[118] : 
a 0.0 

b 1.0 

c 2.0 

d 3.0 

dtype: float64 

In [119]: obj [ 'b' ] 

0ut[119] : 1.0 
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In [120]: obj [1] 

Out[120] : 1.0 

In [121]: obj[2:4] 

0ut[121] : 
c 2.0 
d 3.0 
dtype: float64 

In [122]: obj[['b', 'a', ' d'] ] 

0ut[122] : 
b 1.0 
a 0.0 
d 3.0 
dtype: float64 

In [123]: obj[[l, 3]] 

0ut[123] : 
b 1.0 
d 3.0 
dtype: float64 

In [124]: obj[obj < 2] 

0ut[124] : 
a 0.0 
b 1.0 
dtype: float64 

Slicing with labels behaves differently than normal Python slicing in that the end¬ 
point is inclusive: 

In [125] : obj [ 'b' : 'c' ] 

0ut[125] : 
b 1.0 
c 2.0 
dtype: float64 

Setting using these methods modifies the corresponding section of the Series: 

In [126] : obj [' b' : 1 c' ] = 5 

In [127]: obj 
0ut[127] : 
a 0.0 
b 5.0 
c 5.0 
d 3.0 
dtype: float64 

Indexing into a DataFrame is for retrieving one or more columns either with a single 
value or sequence: 
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In [128]: data = pd.DataFrame(np.arange(16).reshape((4, 4)), 

.: index=[ 1 Ohio 1 , 'Colorado', 'Utah', 'New York'], 

.: columns=['one', 'two', 'three', 'four']) 


In [129]: 
0ut[129]: 

data 





one two three 

four 

Ohio 

0 

1 

2 

3 

Colorado 

4 

5 

6 

7 

Utah 

8 

9 

10 

11 

New York 

12 

13 

14 

15 

In [130]: 
Out[130] : 

data[ 1 

two' ] 



Ohio 

1 




Colorado 

5 




Utah 

9 




New York 

13 




Name: two. 

, dtype 

: int64 


In [131]: 

data[[ 

'three' 

J 

one 1 ] ] 

0ut[131]: 

three 

one 



Ohio 

2 

0 



Colorado 

6 

4 



Utah 

10 

8 



New York 

14 

12 




Indexing like this has a few special cases. First, slicing or selecting data with a boolean 
array: 

In [132]: data[:2] 

0ut[132]: 

one two three four 

Ohio 0 1 2 3 

Colorado 45 67 

In [133]: data[data[' three'] > 5] 

0ut[133]: 

one two three four 

Colorado 45 67 

Utah 89 10 11 

New York 12 13 14 15 

The row selection syntax data[ :2] is provided as a convenience. Passing a single ele¬ 
ment or a list to the [ ] operator selects columns. 

Another use case is in indexing with a boolean DataFrame, such as one produced by a 
scalar comparison: 
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In [134]: 

data < 

5 


0ut[134] : 

one 

two 

three four 

Ohio 

True 

T rue 

True True 

Colorado 

True 

False 

False False 

Utah 

False 

False 

False False 

New York 

False 

False 

False False 

In [135]: 

data[data < 1 

5] = 0 

In [136]: 
0ut[136] : 

data 

one two three four 

Ohio 

0 

0 

0 0 

Colorado 

0 

5 

6 7 

Utah 

8 

9 

10 11 

New York 

12 

13 

14 15 


This makes DataFrame syntactically more like a two-dimensional NumPy array in 
this particular case. 

Selection with loc and iloc 

For DataFrame label-indexing on the rows, I introduce the special indexing operators 
loc and iloc. They enable you to select a subset of the rows and columns from a 
DataFrame with NumPy-like notation using either axis labels (loc) or integers 
(iloc). 

As a preliminary example, let’s select a single row and multiple columns by label: 

In [137]: data.loc[ 'Colorado' , ['two', 'three']] 

0ut[137]: 
two 5 

three 6 

Name: Colorado, dtype: int64 

We’ll then perform some similar selections with integers using iloc: 

In [138]: data.lloc[2, [3, 0, 1]] 

0ut[138]: 
four 11 

one 8 

two 9 

Name: Utah, dtype: int64 

In [139]: data.lloc[2] 

0ut[139]: 


one 8 

two 9 

three 10 

four 11 


Name: Utah, dtype: int64 
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In [140]: data.iloc[[l, 2], [3, 0, 1]] 

Out [140] : 

four one two 
Colorado 705 
Utah 11 8 9 

Both indexing functions work with slices in addition to single labels or lists of labels: 

In [141]: data.loc[ Utah' , 'two'] 

0ut[141] : 

Ohio 0 

Colorado 5 
Utah 9 

Name: two, dtype: int64 

In [142]: data.iloc[ :, :3] [data. three > 5] 

0ut[142] : 

one two three 

Colorado 05 6 

Utah 8 9 10 

New York 12 13 14 

So there are many ways to select and rearrange the data contained in a pandas object. 
For DataFrame, Table 5-4 provides a short summary of many of them. As you’ll see 
later, there are a number of additional options for working with hierarchical indexes. 



When originally designing pandas, I felt that having to type 
frame[:, col] to select a column was too verbose (and error- 
prone), since column selection is one of the most common opera¬ 
tions. I made the design trade-off to push all of the fancy indexing 
behavior (both labels and integers) into the ix operator. In practice, 
this led to many edge cases in data with integer axis labels, so the 
pandas team decided to create the loc and iloc operators to deal 
with strictly label-based and integer-based indexing, respectively. 


The ix indexing operator still exists, but it is deprecated. I do not 
recommend using it. 


Table 5-4. Indexing options with DataFrame 


1 Type 

Notes I 

df[val] 

Select single column or sequence of columns from the DataFrame; special case 
conveniences: boolean array (filter rows), slice (slice rows), or boolean DataFrame 
(set values based on some criterion) 

df,loc[val] 

Selects single row or subset of rows from the DataFrame by label 

df.loc[:, val] 

Selects single column or subset of columns by label 

df.loc[vall, val2] 

Select both rows and columns by label 

df.iloc[where] 

Selects single row or subset of rows from the DataFrame by integer position 


144 | Chapter 5: Getting Started with pandas 






1 Type 

Notes 


df.itoc[:, where] 

Selects single column or subset of columns by integer position 


df.itoc[where_t, where_j] 

Select both rows and columns by integer position 


df,at[label_L, Iabel_j] 

Select a single scalar value by row and column label 


df.iat[i, j] 

Select a single scalar value by row and column position (integers) 


reindex method 

Select either rows or columns by labels 


get_value, set_value methods 

Select single value by row and column label 



Integer Indexes 

Working with pandas objects indexed by integers is something that often trips up 
new users due to some differences with indexing semantics on built-in Python data 
structures like lists and tuples. For example, you might not expect the following code 
to generate an error: 

ser = pd.Serles(np.arange(3. )) 
ser 

serf - 1] 

In this case, pandas could “fall back” on integer indexing, but it’s difficult to do this in 
general without introducing subtle bugs. Here we have an index containing 0, 1, 2, 
but inferring what the user wants (label-based indexing or position-based) is difficult: 

In [144]: ser 
0ut[144]: 

0 0.0 
1 1.0 
2 2.0 
dtype: float64 

On the other hand, with a non-integer index, there is no potential for ambiguity: 

In [145]: ser2 = pd.Series(np.arange(3. ), tndex=['a', 'b', ' c' ]) 

In [146]: ser2[-l] 

0ut[146]: 2.0 

To keep things consistent, if you have an axis index containing integers, data selection 
will always be label-oriented. For more precise handling, use loc (for labels) or iloc 
(for integers): 

In [147]: ser[:l] 

0ut[147]: 

0 0.0 
dtype: float64 

In [148]: ser.toc[:l] 

0ut[148]: 

0 0.0 
1 1.0 
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dtype: float64 


In [149]: ser.tloc[:l] 

0ut[149] : 

0 0.0 
dtype: float64 

Arithmetic and Data Alignment 

An important pandas feature for some applications is the behavior of arithmetic 
between objects with different indexes. When you are adding together objects, if any 
index pairs are not the same, the respective index in the result will be the union of the 
index pairs. For users with database experience, this is similar to an automatic outer 
join on the index labels. Let’s look at an example: 

In [150]: si = pd.Series([7.3, -2.5, 3.4, 1.5], index=['a', 'c', 'd 1 , 'e' ]) 

In [151]: s2 = pd.Series( [ -2. 1, 3.6, -1.5, 4, 3.1], 

.: index=[ 1 a', 'c', 'e', 'f', 'g']) 


In [152]: si 
0ut[152] : 
a 7.3 
c -2.5 
d 3.4 
e 1.5 
dtype: float64 

In [153]: s2 
0ut[153] : 
a -2.1 
c 3.6 
e -1.5 
f 4.0 
g 3.1 
dtype: float64 

Adding these together yields: 

In [154]: si + s2 
0ut[154] : 
a 5.2 
c 1.1 
d NaN 
e 0.0 
f NaN 
g NaN 
dtype: float64 

The internal data alignment introduces missing values in the label locations that don’t 
overlap. Missing values will then propagate in further arithmetic computations. 
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In the case of DataFrame, alignment is performed on both the rows and the columns: 

In [155]: dfl = pd.DataFrame(np.arange(9. ). reshape((3, 3)), columns=list( 1 bed' ), 
.: index=['Ohio', 'Texas', 'Colorado']) 

In [156]: df2 = pd.DataFrame(np.arange(12. ). reshape((4, 3)), columns=list( 1 bde' ), 
.: index=['Utah', 'Ohio', 'Texas', 'Oregon']) 


In [157]: dfl 
0ut[157]: 


b 

c 

d 

Ohio 0.0 

1.0 

2.0 

Texas 3.0 

4.0 

5.0 

Colorado 6.0 

7.0 

8.0 

In [158]: df2 



0ut[158]: 



b 

d 

e 

Utah 0.0 

1.0 

2.0 

Ohio 3.0 

4.0 

5.0 

Texas 6.0 

7.0 

8.0 


Oregon 9.0 10.0 11.0 

Adding these together returns a DataFrame whose index and columns are the unions 
of the ones in each DataFrame: 

In [159]: dfl + df2 
0ut[159]: 

be d e 
Colorado NaN NaN NaN NaN 

Ohio 3.0 NaN 6.0 NaN 

Oregon NaN NaN NaN NaN 

Texas 9.0 NaN 12.0 NaN 

Utah NaN NaN NaN NaN 

Since the ' c' and ' e 1 columns are not found in both DataFrame objects, they appear 
as all missing in the result. The same holds for the rows whose labels are not common 
to both objects. 

If you add DataFrame objects with no column or row labels in common, the result 
will contain all nulls: 

In [160]: dfl = pd.DataFrame({ 'A' : [1, 2]}) 

In [161]: df2 = pd.DataFrame({ 'B' : [3, 4]}) 

In [162]: dfl 
0ut[162]: 

A 

0 1 
1 2 

In [163]: df2 
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Out [ 163 ]: 

B 

0 3 

1 4 

In [164]: dfl - dfZ 
Out[ 164] : 

A B 

0 NaN NaN 

1 NaN NaN 

Arithmetic methods with fill values 

In arithmetic operations between differently indexed objects, you might want to fill 
with a special value, like 0, when an axis label is found in one object but not the other: 

In [165]: dfl = pd.DataFrame(np.arange(12. ). reshape((3, 4)), 

.: columns=li.st(' abed')) 

In [166]: df2 = pd.DataFrame(np.arange(20. ). reshape((4, 5)), 

.: cotumns=lfst( 1 abede')) 


In [167]: df2.Ioc[l, 'b' ] = np.nan 

In [168]: dfl 
0ut[168]: 

abed 
0 0.0 1.0 2.0 3.0 

1 4.0 5.0 6.0 7.0 

2 8.0 9.0 10.0 11.0 


In [169]: df2 
0ut[169] : 



a 

b 

c 

d 

e 

0 

0.0 

1.0 

2.0 

3.0 

4.0 

1 

5.0 

NaN 

7.0 

8.0 

9.0 

2 

10.0 

11.0 

12.0 

13.0 

14.0 

3 

15.0 

16.0 

17.0 

18.0 

19.0 


Adding these together results in NA values in the locations that don’t overlap: 

In [170]: dfl + dfZ 
Out[170] : 

a b c d e 


0 

0.0 

2.0 

4.0 

6.0 

NaN 

1 

9.0 

NaN 

13.0 

15.0 

NaN 

2 

18.0 

20.0 

22.0 

24.0 

NaN 

3 

NaN 

NaN 

NaN 

NaN 

NaN 


Using the add method on dfl, I pass df2 and an argument to fill_value: 

In [171]: dfl.add(df2, ftll_value=0) 

0ut[171] : 
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a 

b 

c 

d 

e 

0 

0.0 

2.0 

4.0 

6.0 

4.0 

1 

9.0 

5.0 

13.0 

15.0 

9.0 

2 

18.0 

20.0 

22.0 

24.0 

14.0 

3 

15.0 

16.0 

17.0 

18.0 

19.0 


See Table 5-5 for a listing of Series and DataFrame methods for arithmetic. Each of 
them has a counterpart, starting with the letter r, that has arguments flipped. So these 
two statements are equivalent: 


In [172]: 1 / dfl 
0ut[172]: 

a b 

0 inf 1.000000 

1 0.250000 0.200000 

2 0.125000 0.111111 


c d 
0.500000 0.333333 
0.166667 0.142857 
0.100000 0.090909 


In [173]: dfl.rdiv(l) 
0ut[173] : 

a b 

0 inf 1.000000 

1 0.250000 0.200000 

2 0.125000 0.111111 


c d 
0.500000 0.333333 
0.166667 0.142857 
0.100000 0.090909 


Relatedly, when reindexing a Series or DataFrame, you can also specify a different fill 
value: 


In [174]: dfl.reindex(cotunns=df2.columns, flll_value=0) 


0ut[174] : 
a b 

c 

d 

e 

0 0.0 1.0 

2.0 

3.0 

0 

1 4.0 5.0 

6.0 

7.0 

0 

2 8.0 9.0 

10.0 

11.0 

0 


Table 5-5. Flexible arithmetic methods 


1 Method 

Description 1 

add. 

radd 

Methods for addition (+) 

sub, 

rsub 

Methods for subtraction (-) 

div. 

rdiv 

Methods for division (/) 

floordiv, rfloordiv 

Methods for floor division (//) 

nut. 

rmul 

Methods for multiplication (*) 

pow, 

rpow 

Methods for exponentiation (**) 


Operations between DataFrame and Series 

As with NumPy arrays of different dimensions, arithmetic between DataFrame and 
Series is also defined. First, as a motivating example, consider the difference between 
a two-dimensional array and one of its rows: 
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In [175]: arr = np.arange(12. ). reshape((3, 4)) 


In [176]: arr 
Out[176]: 


array( [ [ 

0., 

1., 

2., 

3.], 

[ 

4., 

5 ■ , 

6., 

7.], 

[ 

8., 

9., 

10., 

11-]]) 


In [177]: arr[0] 

Out[177] : array( [ 0., 1., 2., 3.]) 


In [178]: arr - arr[0] 
Out[178]: 


array( [[ 

0., 

0., 

0., 

0.], 

[ 

4., 

4., 

4., 

4.], 

[ 

8., 

8., 

8., 

8.]]) 


When we subtract arr[0] from arr, the subtraction is performed once for each row. 
This is referred to as broadcasting and is explained in more detail as it relates to gen¬ 
eral NumPy arrays in Appendix A. Operations between a DataFrame and a Series are 
similar: 

In [179]: frame = pd,DataFrame(np. arange(12 . ). reshape((4, 3)), 

.: columns=list('bde'), 

.: tndex=[’ Utah 1 , 'Ohio', 'Texas', 'Oregon']) 


In [180]: series = frame. tloc[0] 


In [181]: frame 
0ut[181]: 



b 

d 

e 

Utah 

0.0 

1.0 

2.0 

Ohio 

3.0 

4.0 

5.0 

Texas 

6.0 

7.0 

8.0 

Oregon 

9.0 

10.0 

11.0 


In [182]: series 
0ut[182]: 
b 0.0 
d 1.0 
e 2.0 

Name: Utah, dtype: float64 

By default, arithmetic between DataFrame and Series matches the index of the Series 
on the DataFrame’s columns, broadcasting down the rows: 

In [183]: frame - series 
0ut[183]: 



b 

d 

e 

Utah 

0.0 

0.0 

0.0 

Ohio 

3.0 

3.0 

3.0 

Texas 

6.0 

6.0 

6.0 

Oregon 

9.0 

9.0 

9.0 
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If an index value is not found in either the DataFrame’s columns or the Series’s index, 
the objects will be reindexed to form the union: 

In [184]: series2 = pd.Serles(range(3), index=['b', 'e', 'f']) 


In [185] 

: frame 

+ series2 

0ut[185] 

b 

d 

e 

f 

Utah 

0.0 

NaN 

3.0 

NaN 

Ohio 

3.0 

NaN 

6.0 

NaN 

Texas 

6.0 

NaN 

9.0 

NaN 

Oregon 

9.0 

NaN 

12.0 

NaN 


If you want to instead broadcast over the columns, matching on the rows, you have to 
use one of the arithmetic methods. For example: 

In [186]: series3 = frame[’d'] 


In [187]: frame 
0ut[187] : 



b 

d 

e 

Utah 

0.0 

1.0 

2.0 

Ohio 

3.0 

4.0 

5.0 

Texas 

6.0 

7.0 

8.0 

Oregon 

9.0 

10.0 

11.0 


In [188]: series3 
0ut[188] : 

Utah 1.0 

Ohio 4.0 

Texas 7.0 

Oregon 10.0 

Name: d, dtype: ftoat64 

In [189]: frame.sub(series3, axis='lndex' ) 
0ut[189] : 



b 

d 

e 

Utah 

-1.0 

0.0 

1.0 

Ohio 

-1.0 

0.0 

1.0 

Texas 

-1.0 

0.0 

1.0 

Oregon 

-1.0 

0.0 

1.0 


The axis number that you pass is the axis to match on. In this case we mean to match 
on the DataFrame’s row index (axis=' index' or axis=0) and broadcast across. 

Function Application and Mapping 

NumPy ufuncs (element-wise array methods) also work with pandas objects: 

In [190]: frame = pd,DataFrame(np.random.randn(4, 3), columns=list( 1 bde' ), 

.: index=[’ Utah 1 , 'Ohio', 'Texas', 'Oregon']) 
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In [191]: frame 
0ut[191] : 

b 

Utah -0.204708 
Ohio -0.555730 
Texas 0.092908 
Oregon 1.246435 


d e 
0.478943 -0.519439 
1.965781 1.393406 
0.281746 0.769023 
1.007189 -1.296221 


In [192]: np.abs(frame) 

0ut[192] : 

b d e 

Utah 0.204708 0.478943 0.519439 

Ohio 0.555730 1.965781 1.393406 

Texas 0.092908 0.281746 0.769023 

Oregon 1.246435 1.007189 1.296221 


Another frequent operation is applying a function on one-dimensional arrays to each 
column or row. DataFrame’s apply method does exactly this: 


In [193]: f = lambda x: x.max() - x.minQ 


In [194]: frame.apply(f) 

0ut[194] : 
b 1.802165 
d 1.684034 
e 2.689627 
dtype: float64 

Here the function f, which computes the difference between the maximum and mini¬ 
mum of a Series, is invoked once on each column in frame. The result is a Series hav¬ 
ing the columns of frame as its index. 

If you pass axis='columns' to apply, the function will be invoked once per row 
instead: 

In [195]: frame.apply(f, axts= 'columns' ) 

0ut[195] : 

Utah 0.998382 

Ohio 2.521511 

Texas 0.676115 

Oregon 2.542656 

dtype: float64 

Many of the most common array statistics (like sum and mean) are DataFrame meth¬ 
ods, so using apply is not necessary. 

The function passed to apply need not return a scalar value; it can also return a Series 
with multiple values: 

In [196] : def f(x) : 

.: return pd.Serles( [x.mln( ), x.maxQ], tndex=[ 1 min 1 , 'max']) 


In [197]: frame.apply(f) 
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Out [197] : 


b d e 

min -0.555730 0.281746 -1.296221 

max 1.246435 1.965781 1.393406 

Element-wise Python functions can be used, too. Suppose you wanted to compute a 
formatted string from each floating-point value in frame. You can do this with apply 
map: 


In [198]: format = lambda x: '%.2f' % x 

In [199]: frame.applymap(format) 
0ut[199] : 



b 

d 

e 

Utah 

-0.20 

0.48 

-0.52 

Ohio 

-0.56 

1.97 

1.39 

Texas 

0.09 

0.28 

0.77 

Oregon 

1.25 

1.01 

-1.30 


The reason for the name apply map is that Series has a map method for applying an 
element-wise function: 

In [200]: frame[ 'e' ]. map(format) 

Out [200] : 

Utah -0.52 

Ohio 1.39 

Texas 0.77 

Oregon -1.30 
Name: e, dtype: object 

Sorting and Ranking 

Sorting a dataset by some criterion is another important built-in operation. To sort 
lexicographically by row or column index, use the sort_lndex method, which returns 
a new, sorted object: 

In [201]: obj = pd.Series(range(4), lndex=['d', 'a', 'b', 'c']) 

In [202]: obj.sort_index( ) 

Out[202] : 
a 1 
b 2 
c 3 
d 0 

dtype: tnt64 

With a DataFrame, you can sort by index on either axis: 

In [203]: frame = pd.DataFrame(np. arange(8).reshape((2, 4)), 

.: index=[' three' , 'one'], 

.: columns=[ 1 d', 'a', 'b 1 , 'c']) 


In [204]: frame.sort_index( ) 
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Out[204]: 
d 

one 4 

three 0 

a b c 

5 6 7 

12 3 

In [205]: 
Out[205] : 

a 

three 1 

one 5 

frame.sort_Lndex(axis=l) 

bed 

2 3 0 

6 7 4 


The data is sorted in ascending order by default, but can be sorted in descending 


order, too: 


In [206]: 
Out[206] : 
d 

three 0 

one 4 

frame.sort_index(axis=l, ascending=False) 

c b a 

3 2 1 

7 6 5 


To sort a Series by its values, use its sort_values method: 


In [207]: 

obj = pd.Series ([4, 7, -3, 2]) 

In [208]: 
Out[208] : 

2 -3 

3 2 

0 4 

obj,sort_values() 


l 7 

dtype: tnt64 

Any missing values are sorted to the end of the Series by default: 


In [209]: 

obj = pd.Serles([4, np.nan, 7, np.nan, -3, 2]) 

In [210]: 
Out[210] : 

4 -3.0 

5 2.0 

0 4.0 

2 7.0 

1 NaN 

obj.sort_values() 


3 NaN 
dtype: float64 

When sorting a DataFrame, you can use the data in one or more columns as the sort 
keys. To do so, pass one or more column names to the by option of sort_values: 


In [211]: 

frame = pd.DataFrame({ 1 b' : [4, 7, -3, 2], 'a': [0, 1, 0, 1]}) 

In [212]: 
0ut[212] : 
a b 

frame 
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0 0 4 

117 

2 0-3 

3 12 

In [213]: frame.sort_values(by= 'b' ) 

0ut[213] : 
a b 

2 0-3 

3 12 

0 0 4 

117 

To sort by multiple columns, pass a list of names: 

In [214]: frame.sort_values(by=[ 1 a' , 'b' ]) 

0ut[214] : 
a b 

2 0-3 

0 0 4 

3 1 2 

117 

Ranking assigns ranks from one through the number of valid data points in an array. 
The rank methods for Series and DataFrame are the place to look; by default rank 
breaks ties by assigning each group the mean rank: 

In [215]: obj = pd.Series( [7, -5, 7, 4, 2, 0, 4]) 

In [216]: obj.rank() 

0ut[216] : 

0 6.5 

1 1.0 

2 6.5 

3 4.5 

4 3.0 

5 2.0 

6 4.5 
dtype: float64 

Ranks can also be assigned according to the order in which they’re observed in the 
data: 

In [217]: obj.rank(method=' first 1 ) 

0ut[217] : 

0 6.0 

1 1.0 

2 7.0 

3 4.0 

4 3.0 

5 2.0 

6 5.0 
dtype: float64 
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Here, instead of using the average rank 6.5 for the entries 0 and 2, they instead have 
been set to 6 and 7 because label 0 precedes label 2 in the data. 

You can rank in descending order, too: 

# Assign tie values the naxinun rank in the group 
In [218]: obj . rank(ascendlng=False, method='piax' ) 

0ut[218] : 


2.0 

7.0 

2.0 

4.0 

5.0 

6.0 

4.0 


dtype: float64 

See Table 5-6 for a list of tie-breaking methods available. 
DataFrame can compute ranks over the rows or the columns: 


In [219]: frame 


pd,DataFrame({ 1 b' : 

1 c' : 


[4.3, 7, -3, 2], 'a' 
[-2, 5, 8, -2.5]}) 


[ 0 , 1 , 0 , 1 ], 


In [220]: frame 
Out[220] : 

a b c 
0 0 4.3 -2.0 

1 1 7.0 5.0 

2 0 -3.0 8.0 

3 1 2.0 -2.5 

In [221]: frame.rank(axis= 'columns' ) 
0ut[221] : 

a b c 
0 2.0 3.0 1.0 

1 1.0 3.0 2.0 

2 2.0 1.0 3.0 

3 2.0 3.0 1.0 

Table 5-6. Tie-breaking methods with rank 


Method Description 


' average 1 Default: assign the average rank to each entry in the equal group 
' min 1 Use the minimum rank for the whole group 

' max 1 Use the maximum rank for the whole group 

'first' Assign ranks in the order the values appear in the data 

'dense' Like method=' min', but ranks always increase by 1 in between groups rather than the number of equal 
elements in a group 
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Axis Indexes with Duplicate Labels 

Up until now all of the examples we’ve looked at have had unique axis labels (index 
values). While many pandas functions (like reindex) require that the labels be 
unique, it’s not mandatory. Let’s consider a small Series with duplicate indices: 

In [222]: obj = pd.Serles(range(5), index=[ 'a 1 , 'a', 'b', 'b' , 'c' ]) 

In [223]: obj 
0ut[223] : 
a 0 
a 1 
b 2 
b 3 
c 4 

dtype: int64 

The index’s is_unique property can tell you whether its labels are unique or not: 

In [224]: obj.index.is_unique 
0ut[224]: False 

Data selection is one of the main things that behaves differently with duplicates. 
Indexing a label with multiple entries returns a Series, while single entries return a 
scalar value: 

In [225]: obj [ 'a' ] 

0ut[225] : 
a 0 
a 1 

dtype: int64 

In [226]: obj [ 'c' ] 

0ut[226] : 4 

This can make your code more complicated, as the output type from indexing can 
vary based on whether a label is repeated or not. 

The same logic extends to indexing rows in a DataFrame: 

In [227]: df = pd.DataFrame(np.random. randn(4, 3), index=[’a', 'a', 'b 1 , 'b'] ) 

In [228]: df 
0ut[228] : 

0 1 
a 0.274992 0.228913 

a 0.886429 -2.001637 
b 1.669025 -0.438570 
b 0.476985 3.248944 

In [229]: df.loc['b'] 

0ut[229] : 

0 12 


2 

1.352917 

-0.371843 

-0.539741 

-1.021228 
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b 1.669025 -0.438570 -0.539741 
b 0.476985 3.248944 -1.021228 

5.3 Summarizing and Computing Descriptive Statistics 

pandas objects are equipped with a set of common mathematical and statistical meth¬ 
ods. Most of these fall into the category of reductions or summary statistics, methods 
that extract a single value (like the sum or mean) from a Series or a Series of values 
from the rows or columns of a DataFrame. Compared with the similar methods 
found on NumPy arrays, they have built-in handling for missing data. Consider a 
small DataFrame: 

In [230]: df = pd.DataFrame( [ [1.4, np.nan], [7.1, -4.5], 

: [np.nan, np.nan], [0.75, -1.3]], 

.: index=['a', 'b', 'c', 'd'], 

.: columns=['one', 'two']) 


In [231]: df 
0ut[231] : 

one two 
a 1.40 NaN 
b 7.10 -4.5 
c NaN NaN 
d 0.75-1.3 

Calling DataFrame’s sum method returns a Series containing column sums: 

In [232]: df.sumQ 
0ut[232] : 
one 9.25 
two -5.80 
dtype: float64 

Passing axis=' columns 1 or axis=l sums across the columns instead: 

In [233]: df .sum(axis='columns' ) 

0ut[233] : 
a 1.40 
b 2.60 

c NaN 
d -0.55 
dtype: float64 

NA values are excluded unless the entire slice (row or column in this case) is NA. 
This can be disabled with the sklpna option: 

In [234]: df.mean(axis= 'columns' , skipna=False) 

0ut[234] : 
a NaN 

b 1.300 

c NaN 
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d -0.275 
dtype: float64 

See Table 5-7 for a list of common options for each reduction method. 
Table 5-7. Options for reduction methods 


Method Description 


axis Axis to reduce over; 0 for DataFrame's rows and 1 for columns 
sklpna Exclude missing values; True by default 

level Reduce grouped by level if the axis is hierarchically indexed (Multiindex) 


Some methods, like idxmin and idxniax, return indirect statistics like the index value 
where the minimum or maximum values are attained: 

In [235]: df.idxmax() 

0ut[235] : 
one b 
two d 
dtype: object 

Other methods are accumulations: 

In [236]: df.cumsumQ 
0ut[236] : 

one two 
a 1.40 NaN 
b 8.50 -4.5 
c NaN NaN 
d 9.25 -5.8 

Another type of method is neither a reduction nor an accumulation, describe is one 
such example, producing multiple summary statistics in one shot: 

In [237]: df.describeQ 
0ut[237] : 

one two 

count 3.000000 2.000000 

mean 3.083333 -2.900000 
std 3.493685 2.262742 

min 0.750000 -4.500000 
25% 1.075000 -3.700000 

50% 1.400000 -2.900000 

75% 4.250000 -2.100000 

max 7.100000 -1.300000 

On non-numeric data, describe produces alternative summary statistics: 

In [238]: obj = pd.Seriesf [ 1 a 1 , 'a', 'b' , 'c' ] * 4) 

In [239]: obj.describe() 

0ut[239] : 
count 16 
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unique 3 

top a 

freq 8 

dtype: object 

See Table 5-8 for a full list of summary statistics and related methods. 


Table 5-8. Descriptive and summary statistics 


1 Method 

Description J 

count 

Number of non-NA values 

describe 

Compute set of summary statistics for Series or each DataFrame column 

min, max 

Compute minimum and maximum values 

argmin, argmax 

Compute index locations (integers) at which minimum or maximum value obtained, respectively 

idxmin, idxmax 

Compute index labels at which minimum or maximum value obtained, respectively 

quantile 

Compute sample quantile ranging from 0 to 1 

sum 

Sum of values 

mean 

Mean of values 

median 

Arithmetic median (50% quantile) of values 

mad 

Mean absolute deviation from mean value 

prod 

Product of all values 

var 

Sample variance of values 

std 

Sample standard deviation of values 

skew 

Sample skewness (third moment) of values 

kurt 

Sample kurtosis (fourth moment) of values 

cumsum 

Cumulative sum of values 

cummin, cummax 

Cumulative minimum or maximum of values, respectively 

cumprod 

Cumulative product of values 

diff 

Compute first arithmetic difference (useful for time series) 

pct_change 

Compute percent changes 


Correlation and Covariance 

Some summary statistics, like correlation and covariance, are computed from pairs of 
arguments. Let’s consider some DataFrames of stock prices and volumes obtained 
from Yahoo! Finance using the add-on pandas-datareader package. If you don’t 
have it installed already, it can be obtained via conda or pip: 

conda install pandas-datareader 

I use the pandas_datareader module to download some data for a few stock tickers: 

import pandas_datareader.dat.-: as web 

all_data = {ticker: web.get_data_yahoo(ticker) 

for ticker in [ 'AAPL' , 'IBM', 'MSFT', 'GOOG']} 
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price = pd.DataFrame({ticker: data['Adj Close'] 

for ticker, data in all_data.items()}) 
volume = pd.DataFrame({ticker: data [ 'Volume' ] 

for ticker, data in all_data.items()}) 



It’s possible by the time you are reading this that Yahoo! Finance no 
longer exists since Yahoo! was acquired by Verizon in 2017. Refer 
to the pandas-datareader documentation online for the latest 
functionality. 


I now compute percent changes of the prices, a time series operation which will be 
explored further in Chapter 11: 

In [242]: returns = price.pct_change() 


In [243]: returns.tail() 

Out [243] : 

AAPL G00G IBM MSFT 


Date 

2016-10-17 

2016-10-18 

2016-10-19 

2016-10-20 

2016-10-21 


-0.000680 

-0.000681 

-0.002979 

-0.000512 

-0.003930 


0.001837 

0.019616 

0.007846 

-0.005652 

0.003011 


0.002072 

-0.026168 

0.003583 

0.001719 

-0.012474 


-0.003483 

0.007690 

-0.002255 

-0.004867 

0.042096 


The corr method of Series computes the correlation of the overlapping, non-NA, 
aligned-by-index values in two Series. Relatedly, cov computes the covariance: 

In [244]: returns[' MSFT' ] .corr(returns[' IBM ']) 

0ut[244]: 0.49976361144151144 


In [245]: returns[ 'MSFT' ] .cov(returns[' IBM' ]) 

0ut[245] : 8.8706554797035462e-05 

Since MSFT is a valid Python attribute, we can also select these columns using more 
concise syntax: 

In [246]: returns.MSFT,corr(returns.IBM) 

0ut[246] : 0.49976361144151144 

DataFrame’s corr and cov methods, on the other hand, return a full correlation or 
covariance matrix as a DataFrame, respectively: 


In [247]: returns. corr() 
0ut[247] : 



AAPL 

GOOG 

IBM 

AAPL 

1.000000 

0.407919 

0.386817 

GOOG 

0.407919 

1.000000 

0.405099 

IBM 

0.386817 

0.405099 

1.000000 

MSFT 

0.389695 

0.465919 

0.499764 


MSFT 

0.389695 

0.465919 

0.499764 

1.000000 
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In [248]: returns. cov() 

Out [248] : 

AAPL GOOG IBM MSFT 

AAPL 0.000277 0.000107 0.000078 0.000095 

GOOG 0.000107 0.000251 0.000078 0.000108 

IBM 0.000078 0.000078 0.000146 0.000089 

MSFT 0.000095 0.000108 0.000089 0.000215 

Using DataFrame’s corrwith method, you can compute pairwise correlations 
between a DataFrame’s columns or rows with another Series or DataFrame. Passing a 
Series returns a Series with the correlation value computed for each column: 

In [249]: returns.corrwith(returns.IBM) 

Out [249] : 

AAPL 0.386817 
GOOG 0.405099 
IBM 1.000000 

MSFT 0.499764 
dtype: float64 

Passing a DataFrame computes the correlations of matching column names. Here I 
compute correlations of percent changes with volume: 

In [250]: returns.corrwith(volume) 

Out[250] : 

AAPL -0.075565 
GOOG -0.007067 
IBM -0.204849 
MSFT -0.092950 
dtype: float64 

Passing axis = 1 columns ' does things row-by-row instead. In all cases, the data points 
are aligned by label before the correlation is computed. 

Unique Values, Value Counts, and Membership 

Another class of related methods extracts information about the values contained in a 
one-dimensional Series. To illustrate these, consider this example: 

In [251]: obj = pd.Series( [' c' , 'a', 'd', 'a', 'a 1 , 'b', 'b', 'c 1 , 'c']) 

The first function is unique, which gives you an array of the unique values in a Series: 

In [252]: uniques = obj.unique() 

In [253]: uniques 

0ut[253]: array(['c', 'a', 'd 1 , ' b' ], dtype=object) 

The unique values are not necessarily returned in sorted order, but could be sorted 
after the fact if needed (uniques. sortQ). Relatedly, value_counts computes a Series 
containing value frequencies: 
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In [254]: obj. value_counts() 

0ut[254] : 
c 3 
a 3 
b 2 
d 1 

dtype: lnt64 

The Series is sorted by value in descending order as a convenience. value_counts is 
also available as a top-level pandas method that can be used with any array or 
sequence: 

In [255]: pd.value_counts(obj.values, sort=False) 

0ut[255] : 
a 3 
b 2 
c 3 
d 1 

dtype: lnt64 

isin performs a vectorized set membership check and can be useful in filtering a 
dataset down to a subset of values in a Series or column in a DataFrame: 

In [256]: obj 
0ut[256] : 

0 c 

1 a 

2 d 

3 a 

4 a 

5 b 

6 b 

7 c 

8 c 

dtype: object 

In [257]: mask = obj.tsin( [' b' , ' c' ]) 

In [258]: mask 
0ut[258] : 

0 True 

1 False 

2 False 

3 False 

4 False 

5 True 

6 True 

7 True 

8 True 
dtype: bool 

In [259]: obj[mask] 

0ut[259] : 
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0 c 

5 b 

6 b 

7 c 

8 c 

dtype: object 

Related to isin is the Index.get_indexer method, which gives you an index array 
from an array of possibly non-distinct values into another array of distinct values: 

In [260]: to_match = pd.Series( [' c' , 'a 1 , 'b' , ' b 1 , 'c' , 'a']) 

In [261]: unique_vals = pd.Series( [ 1 c 1 , ’ b' , 'a']) 

In [262]: pd.Index(unique_vals).get_lndexer(to_rnatch) 

0ut[262] : array( [0, 2 , 1 , 1, 0, 2]) 

See Table 5-9 for a reference on these methods. 

Table 5-9. Unique, value counts, and set membership methods 


Method Description 


isin Compute boolean array indicating whether each Series value is contained in the passed sequence of 

values 

natch Compute integer indices for each value in an array into another array of distinct values; helpful for data 

alignment and join-type operations 

unique Compute array of unique values in a Series, returned in the order observed 

value_counts Return a Series containing unique values as its index and frequencies as its values, ordered count in 
descending order 


In some cases, you may want to compute a histogram on multiple related columns in 
a DataFrame. Here’s an example: 


In [263]: data = pd.DataFrame({ 'Qul 1 : 

.: 'Qu2' : 

.: 'Qu3' : 


[1, 3, 4, 3, 4], 

[2, 3, 1, 2, 3], 

[1, 5, 2, 4, 4]}) 


In [264]: data 
0ut[264] : 

Qul Qu2 Qu3 
0 12 1 

13 3 5 

2 4 12 

3 3 2 4 

4 4 3 4 

Passing pandas. value_counts to this DataFrame’s apply function gives: 
In [265]: result = data.apply(pd.value_counts).fillna(0) 

In [266]: result 
0ut[266] : 
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Qul 

Qu2 

Qu3 

1 

1.0 

1.0 

1.0 

2 

0.0 

2.0 

1.0 

3 

2.0 

2.0 

0.0 

4 

2.0 

0.0 

2.0 

5 

0.0 

0.0 

1.0 


Here, the row labels in the result are the distinct values occurring in all of the col¬ 
umns. The values are the respective counts of these values in each column. 

5.4 Conclusion 

In the next chapter, we’ll discuss tools for reading (or loading) and writing datasets 
with pandas. After that, we’ll dig deeper into data cleaning, wrangling, analysis, and 
visualization tools using pandas. 
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CHAPTER 6 


Data Loading, Storage, and File Formats 


Accessing data is a necessary first step for using most of the tools in this book. I’m 
going to be focused on data input and output using pandas, though there are numer¬ 
ous tools in other libraries to help with reading and writing data in various formats. 

Input and output typically falls into a few main categories: reading text files and other 
more efficient on-disk formats, loading data from databases, and interacting with net¬ 
work sources like web APIs. 


6.1 Reading and Writing Data in Text Format 

pandas features a number of functions for reading tabular data as a DataFrame 
object. Table 6-1 summarizes some of them, though read_csv and read_table are 
likely the ones you’ll use the most. 


Table 6-1. Parsing functions in pandas 


Function Description 


read_csv 

read_table 

read_fwf 

read_cltpboard 

read_excel 

read_hdf 

read_html 

read_json 

read_msgpack 

read_pickle 


Load delimited data from a file, URL, or file-like object; use comma as default delimiter 
Load delimited data from a file, URL, or file-like object; use tab (' \t') as default delimiter 
Read data in fixed-width column format (i.e., no delimiters) 

Version of read_table that reads data from the clipboard; useful for converting tables from web 
pages 

Read tabular data from an Excel XLS or XLSX file 

Read HDF5 files written by pandas 

Read all tables found in the given HTML document 

Read data from a JSON (JavaScript Object Notation) string representation 

Read pandas data encoded using the MessagePack binary format 

Read an arbitrary object stored in Python pickle format 
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Function 


Description 


read_sas 
read_sql 
read_stata 
read feather 


Read a SAS dataset stored in one of the SAS system's custom storage formats 
Read the results of a SQL query (using SQLAIchemy) as a pandas DataFrame 
Read a dataset from Stata file format 
Read the Feather binary file format 


I’ll give an overview of the mechanics of these functions, which are meant to convert 
text data into a DataFrame. The optional arguments for these functions may fall into 
a few categories: 

Indexing 

Can treat one or more columns as the returned DataFrame, and whether to get 
column names from the file, the user, or not at all. 

Type inference and data conversion 

This includes the user-defined value conversions and custom list of missing value 
markers. 

Datetime parsing 

Includes combining capability, including combining date and time information 
spread over multiple columns into a single column in the result. 

Iterating 

Support for iterating over chunks of very large files. 

Unclean data issues 

Skipping rows or a footer, comments, or other minor things like numeric data 
with thousands separated by commas. 

Because of how messy data in the real world can be, some of the data loading func¬ 
tions (especially read_csv) have grown very complex in their options over time. It’s 
normal to feel overwhelmed by the number of different parameters (read_csv has 
over 50 as of this writing). The online pandas documentation has many examples 
about how each of them works, so if you’re struggling to read a particular file, there 
might be a similar enough example to help you find the right parameters. 

Some of these functions, like pandas. read_csv, perform type inference, because the 
column data types are not part of the data format. That means you don’t necessarily 
have to specify which columns are numeric, integer, boolean, or string. Other data 
formats, like HDF5, Feather, and msgpack, have the data types stored in the format. 

Handling dates and other custom types can require extra effort. Let’s start with a 
small comma-separated (CSV) text file: 

In [8]: !cat examples/exl.csv 
a,b,c,d, message 
1,2,3,4, hello 
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5, 6,7,8, world 
9, 10,11,12, foo 



Here I used the Unix cat shell command to print the raw contents 
of the file to the screen. If you’re on Windows, you can use type 
instead of cat to achieve the same effect. 


Since this is comma-delimited, we can use read_csv to read it into a DataFrame: 

In [9]: df = pd.read_csv( 'examples/exl.csv' ) 

In [10]: df 
Out[10] : 

abed message 
01234 hello 
15678 world 
2 9 10 11 12 foo 

We could also have used read_table and specified the delimiter: 

In [11]: pd.read_table( 'examples/exl.csv' , sep=',') 

Out[ll] : 

abed message 
01234 hello 
15678 world 
2 9 10 11 12 foo 

A file will not always have a header row. Consider this file: 

In [12]: !cat examples/ex2.csv 
1,2,3,4, hello 
5, 6, 7,8, world 
9,10,11,12, foo 

To read this file, you have a couple of options. You can allow pandas to assign default 
column names, or you can specify names yourself: 

In [13]: pd.read_csv( 'examples/ex2.csv 1 , header=None) 

0ut[13] : 

0 12 3 4 

01234 hello 
15678 world 
2 9 10 11 12 foo 

In [14]: pd.read_csv( 1 examples/ex2.csv 1 , names: ['a', ’ b' , 'c', 'd' , 'message']) 
0ut[14] : 

abed message 
01234 hello 
15678 world 
2 9 10 11 12 foo 
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Suppose you wanted the message column to be the index of the returned DataFrame. 
You can either indicate you want the column at index 4 or named ' message' using 
the index_col argument: 

In [15]: names = ['a', 'b' , 'c', 'd 1 , ’message 1 ] 

In [16]: pd.read_csv( 'examples/ex2.csv' , names=names, index_col= ’message' ) 

0ut[16] : 

abed 

message 

hello 12 3 4 

world 5 6 7 8 

foo 9 10 11 12 

In the event that you want to form a hierarchical index from multiple columns, pass a 
list of column numbers or names: 

In [17]: !cat examples/csv_mlndex.csv 

key 1, key 2, valuel,value2 

one, a, 1,2 

one, b, 3, 4 

one, c, 5, 6 

one, d, 7, 8 

two,a, 9, 10 

two,b, 11, 12 

two,c, 13, 14 

two,d, 15, 16 

In [18]: parsed = pd.read_csv( ’examples/csv_mindex.csv' , 

lndex_col=[ 'keyl' , 'key2' ]) 


In [19]: parsed 


0ut[19] : 


valuel 

value2 

keyl key2 

one a 

1 

2 

b 

3 

4 

c 

5 

6 

d 

7 

8 

two a 

9 

10 

b 

11 

12 

c 

13 

14 

d 

15 

16 


In some cases, a table might not have a fixed delimiter, using whitespace or some 
other pattern to separate fields. Consider a text file that looks like this: 

In [20]: ltst(open( 'examptes/ex3.txt' )) 

Out[20] : 

[' A B C\n' , 

1 aaa -0.264438 -1.026059 -0.619500\n’ , 

1 bbb 0.927272 0.302904 -0.032399\n’ , 
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1 ccc -0.264273 -0.386314 -0.217601\n' , 

1 ddd -0.871858 -0.348382 1.100491\n'] 

While you could do some munging by hand, the fields here are separated by a vari¬ 
able amount of whitespace. In these cases, you can pass a regular expression as a 
delimiter for read_table. This can be expressed by the regular expression \s+, so we 
have then: 

In [21]: result = pd.read_table( 'examples/ex3.txt' , sep='\s+') 

In [22]: result 
0ut[22] : 

ABC 
aaa -0.264438 -1.026059 -0.619500 
bbb 0.927272 0.302904 -0.032399 

ccc -0.264273 -0.386314 -0.217601 
ddd -0.871858 -0.348382 1.100491 

Because there was one fewer column name than the number of data rows, 
read_table infers that the first column should be the DataFrame’s index in this spe¬ 
cial case. 

The parser functions have many additional arguments to help you handle the wide 
variety of exception file formats that occur (see a partial listing in Table 6-2). For 
example, you can skip the first, third, and fourth rows of a file with skiprows: 

In [23]: !cat examples/ex4.csv 

# hey! 

a,b,c,d, message 

# just wanted to make things more difficult for you 

# who reads CSV files with computers, anyway? 

1,2,3, 4, hello 

5, 6, 7,8, world 
9,10,ll,12,foo 

In [24]: pd.read_csv( 'examples/ex4.csv' , sklprows=[0, 2, 3]) 

0ut[24] : 

abed message 
01234 hello 
15678 world 
2 9 10 11 12 foo 

Handling missing values is an important and frequently nuanced part of the file pars¬ 
ing process. Missing data is usually either not present (empty string) or marked by 
some sentinel value. By default, pandas uses a set of commonly occurring sentinels, 
such as NA and NULL: 

In [25]: !cat examples/ex5.csv 
something,a,b,c,d,message 
one, 1,2, 3, 4, NA 
two, 5,6, ,8, world 
three, 9, 10, 11,12, foo 

In [26]: result = pd.read_csv( 'examples/ex5.csv' ) 
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In [27]: result 
Out[27] : 



something 

a b 

c d message 


0 

one 

1 2 

3.0 4 NaN 


1 

two 

5 6 

NaN 8 world 


2 

three 

9 10 

11.0 12 foo 


In 

[28]: pd.isnull(result) 


Out[28] : 





something 

a 

bed 

message 

0 

False 

False 

False False False 

I rue 

1 

False 

False 

False True False 

False 

2 

False 

False 

False False False 

False 


The na_values option can take either a list or set of strings to consider missing 
values: 

In [29]: result = pd.read_csv( 'examples/ex5.csv' , na_values [' NULL' ]) 


In [30]: result 
Out[30] : 



something 

a 

b 

c 

d 

message 

0 

one 

1 

2 

3.0 

4 

NaN 

1 

two 

5 

6 

NaN 

8 

world 

2 

three 

9 

10 

11.0 

12 

foo 


Different NA sentinels can be specified for each column in a diet: 

In [31]: sentinels = {'message 1 : ['foo', 'NA'], 'something': ['two']} 

In [32]: pd.read_csv( 'examples/ex5.csv' , na_values=sentinels) 

0ut[32] : 



something 

a 

b 

c 

d 

message 

0 

one 

1 

2 

3.0 

4 

NaN 

1 

NaN 

5 

6 

NaN 

8 

world 

2 

three 

9 

10 

11.0 

12 

NaN 


Table 6-2 lists some frequently used options in pandas.read_csv and pan 
das. read_table. 

Table 6-2. Some read_csv/read_table function arguments 


Argument Description 


path String indicating filesystem location, URL, or file-like object 

sep or delimiter Character sequence or regular expression to use to split fields in each row 
header Row number to use as column names; defaults to 0 (first row), but should be None if there is no 

header row 

index_col Column numbers or names to use as the row index in the result; can be a single name/number or a 

list of them for a hierarchical index 

names List of column names for result, combine with header=None 
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1 Argument 

Description | 

sklprows 

Number of rows at beginning of file to ignore or list of row numbers (starting from 0) to skip. 

na_values 

Sequence of values to replace with NA. 

comment 

Character(s) to split comments off the end of lines. 

parse_dates 

Attempt to parse data to datetime; False by default. If True, will attempt to parse all columns. 
Otherwise can specify a list of column numbers or name to parse. If element of list is tuple or list, will 
combine multiple columns together and parse to date (e.g., if date/time split across two columns). 

keep_date_col 

If joining columns to parse date, keep the joined columns; False by default. 

converters 

Diet containing column number of name mapping to functions (e.g., {' foo 1 : f } would apply the 
function f to all values in the 1 foo' column). 

dayfirst 

When parsing potentially ambiguous dates, treat as international format (e.g., 7/6/2012 -> June 7, 
2012); False by default. 

date_parser 

Function to use to parse dates. 

nrows 

Number of rows to read from beginning of file. 

iterator 

Return a TextParser object for reading file piecemeal. 

chunksize 

For iteration, size of file chunks. 

skip_footer 

Number of lines to ignore at end of file. 

verbose 

Print various parser output information, like the number of missing values placed in non-numeric 
columns. 

encoding 

Text encoding for Unicode (e.g., ' utf-8 1 for UTF-8 encoded text). 

squeeze 

If the parsed data only contains one column, return a Series. 

thousands 

Separator for thousands (e.g., ' , ' or 1 .'). 


Reading Text Files in Pieces 

When processing very large files or figuring out the right set of arguments to cor¬ 
rectly process a large file, you may only want to read in a small piece of a file or iterate 
through smaller chunks of the file. 

Before we look at a large file, we make the pandas display settings more compact: 

In [33]: pd.options.display.max_rows = 10 

Now we have: 


In [34]: result = pd.read_csv( 'examples/ex6.csv' ) 


In [35]: result 


0ut[35] : 

one two three 

0 0.467976 -0.038649 -0.295344 

1 -0.358893 1.404453 0.704965 

2 -0.501840 0.659254 -0.421691 

3 0.204886 1.074134 1.388361 

4 0.354628 -0.133116 0.283763 


four key 
-1.824726 L 
-0.200638 B 
-0.057688 G 
-0.982404 R 
-0.837063 Q 


9995 2.311896 -0.417070 -1.409599 -0.515821 L 
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9996 -0.479893 -0.650419 0.745152 -0.646038 E 

9997 0.523331 0.787112 0.486066 1.093156 K 

9998 -0.362559 0.598894 -1.843201 0.887292 G 

9999 -0.096376 -1.012999 -0.657431 -0.573315 0 

[10000 rows x 5 columns] 

If you want to only read a small number of rows (avoiding reading the entire file), 
specify that with nrows: 

In [36]: pd.read_csv( 'examples/ex6.csv' , nrows=5) 

0ut[36] : 

one two three four key 

0 0.467976 -0.038649 -0.295344 -1.824726 L 

1 -0.358893 1.404453 0.704965 -0.200638 B 

2 -0.501840 0.659254 -0.421691 -0.057688 G 

3 0.204886 1.074134 1.388361 -0.982404 R 

4 0.354628 -0.133116 0.283763 -0.837063 Q 

To read a file in pieces, specify a chunksize as a number of rows: 

In [37]: chunker = pd.read_csv( 'examples/ex6.csv' , chunksize=1000) 

In [38]: chunker 

0ut[38]: <pandas.to.parsers.TextFileReader at 0x7f6ble2672e8> 

The TextParser object returned by read_csv allows you to iterate over the parts of 
the file according to the chunksize. For example, we can iterate over ex6.csv, aggre¬ 
gating the value counts in the ' key ' column like so: 

chunker = pd.read_csv( 'examples/ex6.csv' , chunkstze=1000) 

tot = pd.Series( []) 
for piece in chunker: 

tot = tot. add(ptece[' key' ]. value_counts() , fill_value=0) 

tot = tot.sort_values(ascendlng=False) 

We have then: 

In [40]: tot[ : 10] 

Out[40] : 

E 368.0 
X 364.0 
L 346.0 
0 343.0 

Q 340.0 
M 338.0 
3 337.0 

F 335.0 
K 334.0 
H 330.0 
dtype: float64 
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TextParser is also equipped with a get_chunk method that enables you to read 
pieces of an arbitrary size. 

Writing Data to Text Format 

Data can also be exported to a delimited format. Let’s consider one of the CSV files 
read before: 

In [41]: data = pd.read_csv( 'examples/ex5.csv' ) 


In [42]: data 
0ut[42] : 



something 

a 

b 

c 

d 

message 

0 

one 

1 

2 

3.0 

4 

NaN 

1 

two 

5 

6 

NaN 

8 

world 

2 

three 

9 

10 

11.0 

12 

foo 


Using DataFrame’s to_csv method, we can write the data out to a comma-separated 
file: 


In [43]: data.to_csv( 'examples/out.csv' ) 


In [44]: !cat examples/out.csv 
.something,a,b,c,d,message 
0, one, 1,2, 3. 0,4, 

1, two, 5,6,,8, world 

2, three, 9, 10,11.0,12, foo 


Other delimiters can be used, of course (writing to sys.stdout so it prints the text 
result to the console): 


In [45]: Import 


In [46]: data.to_csv(sys.stdout, sep='|') 
|something|a|b|c|d|message 
0 1 one 1 11 2 1 3.014 1 
1 1 two| 516 11 8| world 
2 | three| 9 | 10|11.0 | 12| foo 


Missing values appear as empty strings in the output. You might want to denote them 
by some other sentinel value: 


In [47]: data.to_csv(sys.stdout, na_rep=' NULL 1 ) 
, something, a,b,c,d, message 
0, one, 1,2, 3. 0,4, NULL 

1, two, 5,6, NULL, 8, world 

2, three, 9, 10, 11. 0,12, foo 


With no other options specified, both the row and column labels are written. Both of 
these can be disabled: 
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In [48]: data.to_csv(sys.stdout, index=False, header=False) 
one, 1,2,3. 0,4, 
two, 5,6, ,8, world 
three, 9, 10, 11.0, 12, foo 

You can also write only a subset of the columns, and in an order of your choosing: 

In [49]: data.to_csv(sys.stdout, index=False, columns: [' a 1 , ' b' , 'c' ]) 

a,b,c 

1,2, 3.0 

5,6, 

9,10,11.0 

Series also has a to_csv method: 

In [50]: dates = pd.date_range( '1/1/2000' , perlods=7) 

In [51]: ts = pd.Series(np.arange(7) , !ndex=dates) 

In [52]: ts.to_csv( 'examples/tseries.csv' ) 

In [53]: !cat examples/tseries.csv 
2000 - 01 - 01,0 
2000 - 01 - 02,1 
2000-01-03,2 
2000-01-04,3 
2000-01-05,4 
2000-01-06,5 
2000-01-07,6 

Working with Delimited Formats 

It’s possible to load most forms of tabular data from disk using functions like pan 
das. read_table. In some cases, however, some manual processing may be necessary. 
It’s not uncommon to receive a file with one or more malformed lines that trip up 
read_table. To illustrate the basic tools, consider a small CSV file: 

In [54]: !cat examples/ex7.csv 
"a" , "b" , "c" 

"1" , "2" , "3" 

"1" , "2" , "3" 

For any file with a single-character delimiter, you can use Python’s built-in csv mod¬ 
ule. To use it, pass any open file or file-like object to csv. reader: 

import csv 

f = open( 'examples/ex7.csv' ) 
reader = csv.reader(f) 

Iterating through the reader like a file yields tuples of values with any quote charac¬ 
ters removed: 
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In [56]: for line in reader: 
....: print(line) 


[ 'a' , 

'b'. 

■c'] 

[’!', 

'2', 

'3'] 

['1'. 

'2', 

'3'] 


From there, it’s up to you to do the wrangling necessary to put the data in the form 
that you need it. Let’s take this step by step. First, we read the file into a list of lines: 

In [57]: with open( 'examples/ex7.csv' ) as f: 

....: lines = list(csv. reader(f )) 

Then, we split the lines into the header line and the data lines: 

In [58]: header, values = lines[0], lines[l:] 

Then we can create a dictionary of data columns using a dictionary comprehension 
and the expression zip(*values), which transposes rows to columns: 

In [59]: data_dict = {h: v for h, v in zip(header, zip(*values) )} 

In [60]: data_dict 

Out[60] : {'a': ('1', '1'), ’ b’ : ('2', '2'), V: ('3', '3')} 

CSV files come in many different flavors. To define a new format with a different 
delimiter, string quoting convention, or line terminator, we define a simple subclass 
of csv. Dialect: 

class ny_dialect(csv. Dialect) : 
lineterminator = ' \n' 
delimiter = ' 
quotechar = '" 1 
quoting = csv.QUOTE_MINIMAL 

reader = csv.reader(f , dialect=my_dialect) 

We can also give individual CSV dialect parameters as keywords to csv.reader 
without having to define a subclass: 

reader = csv. reader(f , delimiter' | 1 ) 

The possible options (attributes of csv.Dialect) and what they do can be found in 
Table 6-3. 

Table 6-3. CSV dialect options 


Argument Description 


delimiter One-character string to separate fields; defaults to 1 ,'. 

lineterminator Line terminator for writing; defaults to ' \r\n 1 . Reader ignores this and recognizes cross-platform 
line terminators. 

quotechar Quote character for fields with special characters (like a delimiter); default is 
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Argument Description 


quoting Quoting convention. Options include csv. QUOTE_ALL (quote all fields), csv. QUOTE_MINI 

MAL (only fields with special characters like the delimiter), csv. QUOTE _NONNUMERIC, and 
csv. QUOTE_NONE (no quoting). See Python's documentation for full details. Defaults to 
QUOTE_MINIMAL. 

skiplnltialspace Ignore whitespace after each delimiter; default is False. 

doublequote How to handle quoting character inside a field; if True, it is doubled (see online documentation 

for full detail and behavior). 


escapechar String to escape the delimiter if quoting is set to csv. QUOTE NONE; disabled by default. 



For files with more complicated or fixed multicharacter delimiters, 
you will not be able to use the csv module. In those cases, you’ll 
have to do the line splitting and other cleanup using string’s split 
method or the regular expression method re . split. 


To write delimited files manually, you can use csv.writer. It accepts an open, writa¬ 
ble file object and the same dialect and format options as csv. reader: 

with open( 'mydata.csv 1 , 'w') as f: 

writer = csv.wrlter(f , dlalect=my_dlalect) 
writer.wrlterow(( 'one' , 'two', 'three')) 
writer.wrlterow(( '1' , '2', '3')) 
writer.wrlterow(( '4' , '5', '6')) 
writer.wrlterow(( '7' , '8', '9')) 

JSON Data 

JSON (short for JavaScript Object Notation) has become one of the standard formats 
for sending data by HTTP request between web browsers and other applications. It is 
a much more free-form data format than a tabular text form like CSV. Here is an 
example: 

obj = .. 

{"name": "Wes", 

"places_ltved": [''United States", "Spain", "Germany"], 

"pet": null, 

"siblings": [{"name": "Scott", "age": 30, "pets": ["Zeus", "Zuko"]}, 

{"name": "Katie", "age": 38, 

"pets": ["Sixes", "Stache", "Cisco"]}] 

} 

JSON is very nearly valid Python code with the exception of its null value null and 
some other nuances (such as disallowing trailing commas at the end of lists). The 
basic types are objects (diets), arrays (lists), strings, numbers, booleans, and nulls. All 
of the keys in an object must be strings. There are several Python libraries for reading 
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and writing JSON data. I’ll use json here, as it is built into the Python standard 
library. To convert a JSON string to Python form, use json. loads: 

In [62]: import 

In [63]: result = json.loads(obj ) 

In [64]: result 
0ut[64] : 

{ 'name' : 'Wes' , 

1 pet' : None, 

1 places_ltved' : ['United States', 'Spain', 'Germany'], 

'siblings': [{'age': 30, 'name': 'Scott', 'pets': ['Zeus', 'Zuko']}, 

{'age': 38, 'name': 'Katie', 'pets': ['Sixes', 'Stache', 'Cisco']}]} 

json. dumps, on the other hand, converts a Python object back to JSON: 

In [65]: asjson = json.dumps(result) 

How you convert a JSON object or list of objects to a DataFrame or some other data 
structure for analysis will be up to you. Conveniently, you can pass a list of diets 
(which were previously JSON objects) to the DataFrame constructor and select a sub¬ 
set of the data fields: 

In [66]: siblings = pd. DataFrame(result[ 1 siblings '], columns= [’ name' , 'age']) 

In [67]: siblings 
0ut[67] : 

name age 
0 Scott 30 
1 Katie 38 

The pandas.read_json can automatically convert JSON datasets in specific arrange¬ 
ments into a Series or DataFrame. For example: 

In [68]: !cat examples/example . json 


[{"a" 

1, 

"b" 

2, 

"c" : 3}, 

{"a" 

4, 

"b" 

5, 

"c" : 6}, 

{"a" 

7, 

"b" 

8, 

"c": 9}] 


The default options for pandas. read_json assume that each object in the JSON array 
is a row in the table: 

In [69]: data = pd.read_json( 'examples/example.json' ) 

In [70]: data 
Out[70] : 

a b c 
0 12 3 

14 5 6 

2 7 8 9 
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For an extended example of reading and manipulating JSON data (including nested 
records), see the USDA Food Database example in Chapter 7. 

If you need to export data from pandas to JSON, one way is to use the to_json meth¬ 
ods on Series and DataFrame: 

In [71]: print(data. to_json( )) 

{"a" : {"0" : 1, "1" :4, "2" :7}, "b" : {"0":2, "1" : 5, "2" :8}, "c" : {"0":3, "1" :6, "2" :9}} 

In [72]: print(data.to_json(orient=' records' )) 

[{"a" : 1, "b" :2, "c" : 3},{"a" :4, "b" : 5, "c" :6}, {"a" : 7, "b" :8, "c" :9}] 

XML and HTML: Web Scraping 

Python has many libraries for reading and writing data in the ubiquitous HTML and 
XML formats. Examples include lxml, Beautiful Soup, and html5lib. While lxml is 
comparatively much faster in general, the other libraries can better handle malformed 
HTML or XML files. 

pandas has a built-in function, read_html, which uses libraries like lxml and Beauti¬ 
ful Soup to automatically parse tables out of HTML files as DataFrame objects. To 
show how this works, I downloaded an HTML file (used in the pandas documenta¬ 
tion) from the United States FDIC government agency showing bank failures. 1 First, 
you must install some additional libraries used by read_html: 

conda install lxml 

pip install beautifulsoup4 html5lib 

If you are not using conda, pip install lxml will likely also work. 

The pandas. read_html function has a number of options, but by default it searches 
for and attempts to parse all tabular data contained within <table> tags. The result is 
a list of DataFrame objects: 

In [73]: tables = pd.read_html( 'examples/fdic_failed_bank_list.html' ) 

In [74]: len(tables) 

0ut[74] : 1 

In [75]: failures = tables[0] 


In [76]: failures.head() 
0ut[76] : 



Bank Name 

City 

ST 

CERT 

0 

Allied Bank 

Mulberry 

AR 

91 

1 

The Woodbury Banking Company 

Woodbury 

GA 

11297 

2 

First Cornerstone Bank 

King of Prussia 

PA 

35312 


1 For the full list, see https://www.fdic.gov/bank/individual/failed/banklist.html. 
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3 Trust Company Bank 

4 North Milwaukee State Bank 

Acquiring Institution 
0 Today's Bank 

1 United Bank 

2 First-Citizens Bank & Trust Company 

3 The Bank of Fayette County 

4 First-Citizens Bank & Trust Company 


Memphis TN 9956 
Milwaukee WI 20364 
Closing Date 
September 23, 2016 
August 19, 2016 
May 6, 2016 
April 29, 2016 
March 11, 2016 


Updated Date 
November 17, 2016 
November 17, 
September 6, 
September 6, 

Tune 16, 


2016 

2016 

2016 

2016 


Because failures has many columns, pandas inserts a line break character \. 


As you will learn in later chapters, from here we could proceed to do some data 
cleaning and analysis, like computing the number of bank failures by year: 

In [77]: close_timestamps = pd.to_datetime(failures[' Closing Date']) 

In [78]: close_timestamps.dt.year.value_counts() 


0ut[78] 


2010 

157 

2009 

140 

2011 

92 

2012 

51 

2008 

25 

2004 

4 

2001 

4 

2007 

3 

2003 

3 

2000 

2 

Name: Closing 


Parsing XML with Ixml.objectify 

XML (extensible Markup Language) is another common structured data format sup¬ 
porting hierarchical, nested data with metadata. The book you are currently reading 
was actually created from a series of large XML documents. 

Earlier, I showed the pandas. read_html function, which uses either lxml or Beautiful 
Soup under the hood to parse data from HTML. XML and HTML are structurally 
similar, but XML is more general. Here, I will show an example of how to use lxml to 
parse data from a more general XML format. 

The New York Metropolitan Transportation Authority (MTA) publishes a number of 
data series about its bus and train services. Here we’ll look at the performance data, 
which is contained in a set of XML files. Each train or bus service has a different file 
(like Performance_MNR.xml for the Metro-North Railroad) containing monthly data 
as a series of XML records that look like this: 

<INDICATOR> 

<INDICAT0R_SEQ>373889</INDICAT0R_SEQ> 

<PARENT_SEQx/PARENT_SEQ> 
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<AGENCY_NAME>Met ro-North Railroad</AGENCY_NAME> 

<INDICATOR_NAME>Escalator Availability</INDICATOR_NAME> 

<DESCRIPTION>Percent of the time that escalators are operational 
systemwide. The availability rate is based on physical observations performed 
the morning of regular business days only. This is a new indicator the agency 
began reporting in 2009.</DESCRIPTION> 

<PERIOD_YEAR>2011</PERIOD_YEAR> 

<PERI0D_M0NTH>12</PERI0D_M0NTH> 

<CATEGORY>Ser vice Indicators</CATEGORY> 

<FREQUENCY>M</FREQUENCY> 

<DESIRED_CHANGE>U</DESIRED_CHANGE> 

<INDICATOR_UNIT >%</INDICATOR_UNIT > 

<DECIMAL_PLACES>1</DECIMAL_PLACES> 

<YTD_TARGET>97 . 00</YTD_TARGET> 

<YTD_ACTUALx/YTD_ACTUAL> 

<M0NTHLY_TARGET>97 . 00</MONTHLY_TARGET> 

<MONTHLY_ACTUALx/MONTHLY_ACTUAL> 

</INDICATOR> 

Using Ixml. objectify, we parse the file and get a reference to the root node of the 
XML file with get root: 

from import objectify 

path = 'examples/mta_perf/Performance_MNR.xml' 

parsed = objectify.parse(open(path)) 

root = parsed,getroot( ) 

root.INDICATOR returns a generator yielding each <INDICATOR> XML element. For 
each record, we can populate a diet of tag names (like YTD_ACTUAL) to data values 
(excluding a few tags): 

data = [] 

skip_fields = [ 'PARENT_SEQ 1 , 'INDICATOR_SEQ' , 

'DESIRED_CHANGE 1 , ' DECIMAL_PLACES' ] 

for elt in root.INDICATOR: 
el_data = {} 

for child in elt.getchildrenQ : 
if child.tag in skip_fields: 
continue 

el_data[child.tag] = child.pyval 
data.append(el_data) 

Lastly, convert this list of diets into a DataFrame: 

In [81]: perf = pd.DataFrame(data) 

In [82]: perf.headQ 

0ut[82] : 

Empty DataFrame 
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Columns: [] 

Index: [] 

XML data can get much more complicated than this example. Each tag can have 
metadata, too. Consider an HTML link tag, which is also valid XML: 

from import StringlO 

tag = '<a href="http://www.google.com">Google</a>' 
root = objectify.parse(StringIO(tag) ) .getroot() 

You can now access any of the fields (like h ref) in the tag or the link text: 

In [84]: root 

0ut[84] : <Element a at 0x7f6bl5817748> 

In [85]: root.get(' href' ) 

0ut[85]: 'http://www.google.com' 

In [86]: root.text 
Out[86]: 'Google' 

6.2 Binary Data Formats 

One of the easiest ways to store data (also known as serialization) efficiently in binary 
format is using Python’s built-in pickle serialization, pandas objects all have a 
to_pickle method that writes the data to disk in pickle format: 

In [87]: frame = pd.read_csv( 'examples/exl.csv' ) 

In [88]: frame 
Out[88] : 

abed message 
01234 hello 
15678 world 
2 9 10 11 12 foo 

In [89]: frame.to_pickle( 'examples/frame_pickle' ) 

You can read any “pickled” object stored in a file by using the built-in pickle directly, 
or even more conveniently using pandas. read_pickle: 

In [90]: pd.read_pickle( 'examples/frame_pickle' ) 

Out[90] : 

abed message 
01234 hello 
15678 world 
2 9 10 11 12 foo 


6.2 Binary Data Formats | 183 




pickle is only recommended as a short-term storage format. The 
problem is that it is hard to guarantee that the format will be stable 
over time; an object pickled today may not unpickle with a later 
version of a library. We have tried to maintain backward compati¬ 
bility when possible, but at some point in the future it may be nec¬ 
essary to “break” the pickle format. 


pandas has built-in support for two more binary data formats: HDF5 and Message- 
Pack. I will give some HDF5 examples in the next section, but I encourage you to 
explore different file formats to see how fast they are and how well they work for your 
analysis. Some other storage formats for pandas or NumPy data include: 

bcolz 

A compressable column-oriented binary format based on the Blosc compression 
library. 

Feather 

A cross-language column-oriented file format I designed with the R program¬ 
ming community’s Hadley Wickham. Feather uses the Apache Arrow columnar 
memory format. 


Using HDF5 Format 

HDF5 is a well-regarded file format intended for storing large quantities of scientific 
array data. It is available as a C library, and it has interfaces available in many other 
languages, including Java, Julia, MATLAB, and Python. The “HDF” in HDF5 stands 
for hierarchical data format. Each HDF5 file can store multiple datasets and support¬ 
ing metadata. Compared with simpler formats, HDF5 supports on-the-fly compres¬ 
sion with a variety of compression modes, enabling data with repeated patterns to be 
stored more efficiently. HDF5 can be a good choice for working with very large data¬ 
sets that don’t fit into memory, as you can efficiently read and write small sections of 
much larger arrays. 

While it’s possible to directly access HDF5 files using either the PyTables or h5py 
libraries, pandas provides a high-level interface that simplifies storing Series and 
DataFrame object. The HDFStore class works like a diet and handles the low-level 
details: 


In 

[92] 

In 

[93] 

In 

[94] 

In 

[95] 

In 

[96] 


frame = pd.DataFrame({ 'a' : np.random. randn(100)}) 

store = pd.HDFStore( 'mydata.h5 1 ) 

store[ 'objl' ] = frame 

store[ 'objl_col' ] = framef'a'] 

store 
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Out [96]: 

<class ' pandas . io.pytables.HDFStore 1 > 


File path: mydata.h5 

/objl 


frame 


(shape->[100,l] ) 


/objl_col 


senes 


(shape->[100] ) 


/obj2 
[index] ) 
/obj3 
[index] ) 


frame_table (typ->appendable,nrows- >100, ncols->l, indexers -> 


frame_table (typ->appendable,nrows- >100, ncols->l, indexers -> 


Objects contained in the HDF5 file can then be retrieved with the same dict-like API: 

In [97]: store[ 'objl'] 

0ut[97] : 


a 


0 -0.204708 

1 0.478943 

2 -0.519439 

3 -0.555730 

4 1.965781 

95 0.795253 

96 0.118110 

97 -0.748532 

98 0.584970 

99 0.152677 

[100 rows x 1 columns] 

HDFStore supports two storage schemas, 1 fixed ' and 1 table '. The latter is generally 
slower, but it supports query operations using a special syntax: 

In [98]: store.put( 'obj2' , frame, format= 'table' ) 

In [99]: store.select( 'obj2' , where=[ 'index >= 10 and index <= 15']) 

0ut[99] : 


a 


10 1.007189 

11 -1.296221 

12 0.274992 

13 0.228913 

14 1.352917 

15 0.886429 

In [100]: store. close() 

The put is an explicit version of the store[' obj2' ] = frame method but allows us to 
set other options like the storage format. 

The pandas. read_hdf function gives you a shortcut to these tools: 

In [101]: frame.to_hdf( 'mydata.h5' , 'obj3', format= 'table' ) 
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In [102]: pd.read_hdfCnydata.h5', ' ob j 3' , where=[ ' index < 5']) 
Out[102] : 

a 

0 -0.204708 

1 0.478943 

2 -0.519439 

3 -0.555730 

4 1.965781 



If you are processing data that is stored on remote servers, like 
Amazon S3 or HDFS, using a different binary format designed for 
distributed storage like Apache Parquet may be more suitable. 
Python for Parquet and other such storage formats is still develop¬ 
ing, so I do not write about them in this book. 


If you work with large quantities of data locally, I would encourage you to explore 
PyTables and h5py to see how they can suit your needs. Since many data analysis 
problems are I/O-bound (rather than CPU-bound), using a tool like HDF5 can mas¬ 
sively accelerate your applications. 



HDF5 is not a database. It is best suited for write-once, read-many 
datasets. While data can be added to a file at any time, if multiple 
writers do so simultaneously, the file can become corrupted. 


Reading Microsoft Excel Files 

pandas also supports reading tabular data stored in Excel 2003 (and higher) files 
using either the ExcelFile class or pandas.read_excel function. Internally these 
tools use the add-on packages xlrd and openpyxl to read XLS and XLSX files, respec¬ 
tively You may need to install these manually with pip or conda. 

To use ExcelFile, create an instance by passing a path to an xls or xlsx file: 

In [104]: xlsx = pd.ExcelFlle( 'examples/exl.xlsx' ) 

Data stored in a sheet can then be read into DataFrame with parse: 

In [105]: pd.read_excel(xlsx, 'Sheetl') 

Out[105] : 

abed message 
01234 hello 
15678 world 
2 9 10 11 12 foo 

If you are reading multiple sheets in a file, then it is faster to create the ExcelFile, 
but you can also simply pass the filename to pandas. read_excel: 


186 | Chapter 6: Data Loading, Storage, and File Formats 







In [106]: frame = pd.read_excel( 1 examples/exl.xlsx' , 'Sheetl') 

In [107]: frame 
Out[107]: 

abed message 
01234 hello 
15678 world 
2 9 10 11 12 foo 

To write pandas data to Excel format, you must first create an ExcelWriter, then 
write data to it using pandas objects’ to_excel method: 

In [108]: writer = pd.ExcelWriter( 'examples/ex2.xlsx' ) 

In [109]: frame.to_excel(writer, 'Sheetl') 

In [110]: writer.save( ) 

You can also pass a file path to to_excel and avoid the ExcelWriter: 

In [111]: frame.to_excel( 'examples/ex2.xlsx' ) 

6.3 Interacting with Web APIs 

Many websites have public APIs providing data feeds via JSON or some other format. 
There are a number of ways to access these APIs from Python; one easy-to-use 
method that I recommend is the requests package. 

To find the last 30 GitHub issues for pandas on GitHub, we can make a GET HTTP 
request using the add-on requests library: 

In [113]: import 

In [114]: url = 'https://api.github.com/repos/pandas-dev/pandas/issues' 

In [115]: resp = requests.get(url) 

In [116]: resp 
0ut[116]: <Response [200]> 

The Response object’s json method will return a dictionary containing JSON parsed 
into native Python objects: 

In [117]: data = resp.jsonQ 
In [118]: data[0] [' title' ] 

0ut[118]: 'Period does not round down for frequencies less that 1 hour' 

Each element in data is a dictionary containing all of the data found on a GitHub 
issue page (except for the comments). We can pass data directly to DataFrame and 
extract fields of interest: 
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In [119]: Issues = pd.DataFrame(data, columns=[' number' , 'title', 
.: ' labels' , 'state 1 ]) 


In [120]: issues 
Out[120]: 


0 

1 

2 

3 

4 


number title 

17666 Period does not round down for frequencies les... 
17665 DOC: improve docstring of function where 

17664 COMPAT : skip 32-bit test on int repr 

17662 implement Delegator class 

17654 BUG: Fix series rename called with str alterin... 


\ 


25 

26 

27 

28 
29 


17603 BUG: Correctly localize naive datetime strings... 
17599 core.dtypes.generic --> cython 
17596 Merge cdate_range functionality into bdate_range 
17587 Time Grouper bug fix when applied for list gro... 
17583 BUG: fix tz-aware Datetimelndex + Timedeltalnd... 





labels 

state 

0 



[] 

open 

1 

[{'id' 

134699, 'url' : 

'https://apt.github.com... 

open 

2 

[{'id' 

563047854, 'url 

': 'https://api.github.... 

open 

3 



[] 

open 

4 

[{'id' 

76811, ’url': 

https://api.github.com/... 

open 

25 

[{'id' 

76811, 'url': 

https://api.github.com/... 

open 

26 

[{'id' 

49094459, ’url 

: 'https://api.github.c... 

open 

27 

[{'id' 

35818298, ’url 

: 'https://api.github.c... 

open 

28 

[{'id' 

233160, 'url': 

'https://api.github.com... 

open 

29 

[{'id' 

76811, 'url': 

https://api.github.com/... 

open 

[30 

rows x 

4 columns] 




With a bit of elbow grease, you can create some higher-level interfaces to common 
web APIs that return DataFrame objects for easy analysis. 


6.4 Interacting with Databases 

In a business setting, most data may not be stored in text or Excel files. SQL-based 
relational databases (such as SQL Server, PostgreSQL, and MySQL) are in wide use, 
and many alternative databases have become quite popular. The choice of database is 
usually dependent on the performance, data integrity, and scalability needs of an 
application. 

Loading data from SQL into a DataFrame is fairly straightforward, and pandas has 
some functions to simplify the process. As an example, I’ll create a SQLite database 
using Python’s built-in sqlite3 driver: 

In [121]: import 

In [122]: query = """ 

.: CREATE TABLE test 
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(a VARCHAR(20), b VARCHAR(20), 
c REAL, d INTEGER 

);. 


In [123]: con = sqlite3.connect( 'mydata.sqlite' ) 

In [124]: con.execute(query) 

0ut[124]: <sqllte3.Cursor at 0x7f6bl2a50fl0> 

In [125]: con.commit( ) 

Then, insert a few rows of data: 

In [126]: data = [('Atlanta', 'Georgia', 1.25, 6), 

.: ('Tallahassee', 'Florida', 2.6, 3), 

.: ('Sacramento', 'California', 1.7, 5)] 

In [127]: stmt = "INSERT INTO test VALUES(?, ?, ?, ?)" 

In [128]: con.executemany(stmt, data) 

0ut[128]: <sqlite3.Cursor at 0x7f6bl5c66ce0> 

In [129]: con.commitQ 

Most Python SQL drivers (PyODBC, psycopg2, MySQLdb, pymssql, etc.) return a list 
of tuples when selecting data from a table: 

In [130]: cursor = con.execute( 'select * from test') 

In [131]: rows = cursor.fetchallQ 

In [132]: rows 
0ut[132] : 

[('Atlanta', 'Georgia', 1.25, 6), 

('Tallahassee', 'Florida', 2.6, 3), 

('Sacramento', 'California', 1.7, 5)] 

You can pass the list of tuples to the DataFrame constructor, but you also need the 
column names, contained in the cursor’s description attribute: 

In [133]: cursor.description 
0ut[133] : 

(('a'. None, None, None, None, None, None), 

( 'b' , None, None, None, None, None, None), 

('c'. None, None, None, None, None, None), 

( 'd' , None, None, None, None, None, None)) 

In [134]: pd.DataFrame(rows, columns=[x[0] for x in cursor.description]) 

0ut[134] : 



a 

b 

c 

d 

0 

Atlanta 

Georgia 

1.25 

6 

1 

Tallahassee 

Florida 

2.60 

3 

2 

Sacramento 

California 

1.70 

5 
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This is quite a bit of munging that you’d rather not repeat each time you query the 
database. The SQLAlchemy project is a popular Python SQL toolkit that abstracts 
away many of the common differences between SQL databases, pandas has a 
read_sql function that enables you to read data easily from a general SQLAlchemy 
connection. Here, we’ll connect to the same SQLite database with SQLAlchemy and 
read data from the table created before: 

In [135]: import as 

In [136]: db = sqta,create_engine( 'sqllte:///mydata.sqlite' ) 

In [137]: pd.read_sql( 'select * from test', db) 

0ut[137]: 



a 

b 

c 

d 

0 

Atlanta 

Georgia 

1.25 

6 

1 

Tallahassee 

Florida 

2.60 

3 

2 

Sacramento 

California 

1.70 

5 


6.5 Conclusion 

Getting access to data is frequently the first step in the data analysis process. We have 
looked at a number of useful tools in this chapter that should help you get started. In 
the upcoming chapters we will dig deeper into data wrangling, data visualization, 
time series analysis, and other topics. 
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CHAPTER 7 


Data Cleaning and Preparation 


During the course of doing data analysis and modeling, a significant amount of time 
is spent on data preparation: loading, cleaning, transforming, and rearranging. Such 
tasks are often reported to take up 80% or more of an analyst’s time. Sometimes the 
way that data is stored in files or databases is not in the right format for a particular 
task. Many researchers choose to do ad hoc processing of data from one form to 
another using a general-purpose programming language, like Python, Perl, R, or Java, 
or Unix text-processing tools like sed or awk. Fortunately, pandas, along with the 
built-in Python language features, provides you with a high-level, flexible, and fast set 
of tools to enable you to manipulate data into the right form. 

If you identify a type of data manipulation that isn’t anywhere in this book or else¬ 
where in the pandas library, feel free to share your use case on one of the Python 
mailing lists or on the pandas GitHub site. Indeed, much of the design and imple¬ 
mentation of pandas has been driven by the needs of real-world applications. 

In this chapter I discuss tools for missing data, duplicate data, string manipulation, 
and some other analytical data transformations. In the next chapter, I focus on com¬ 
bining and rearranging datasets in various ways. 

7.1 Handling Missing Data 

Missing data occurs commonly in many data analysis applications. One of the goals 
of pandas is to make working with missing data as painless as possible. For example, 
all of the descriptive statistics on pandas objects exclude missing data by default. 

The way that missing data is represented in pandas objects is somewhat imperfect, 
but it is functional for a lot of users. For numeric data, pandas uses the floating-point 
value NaN (Not a Number) to represent missing data. We call this a sentinel value that 
can be easily detected: 
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In [16]: string_data = pd.Seriesf [ 'aardvark' , 'artichoke', np.nan, 'avocado']) 


In [11]: string_data 
Out[ll] : 

0 aardvark 

1 artichoke 

2 NaN 

3 avocado 
dtype: object 

In [12]: string_data.isnull() 

0ut[12] : 

0 False 

1 False 

2 T rue 

3 False 
dtype: bool 

In pandas, we’ve adopted a convention used in the R programming language by refer¬ 
ring to missing data as NA, which stands for not available. In statistics applications, 
NA data may either be data that does not exist or that exists but was not observed 
(through problems with data collection, for example). When cleaning up data for 
analysis, it is often important to do analysis on the missing data itself to identify data 
collection problems or potential biases in the data caused by missing data. 

The built-in Python None value is also treated as NA in object arrays: 

In [13]: strlng_data[0] = None 

In [14]: string_data.isnull() 

0ut[14] : 

0 T rue 

1 False 

2 T rue 

3 False 
dtype: bool 

There is work ongoing in the pandas project to improve the internal details of how 
missing data is handled, but the user API functions, like pandas.isnull, abstract 
away many of the annoying details. See Table 7-1 for a list of some functions related 
to missing data handling. 

Table 7-1. NA handling methods 


Argument Description 


dropna Filter axis labels based on whether values for each label have missing data, with varying thresholds for how 
much missing data to tolerate. 

fillna Fill in missing data with some value or using an interpolation method such as ' ffill' or' bfill'. 
isnull Return boolean values indicating which values are missing/NA. 
notnull Negation of isnull. 
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Filtering Out Missing Data 

There are a few ways to filter out missing data. While you always have the option to 
do it by hand using pandas. isnull and boolean indexing, the dropna can be helpful. 
On a Series, it returns the Series with only the non-null data and index values: 

In [15]: from inport nan as NA 

In [16]: data = pd.Series([l, NA, 3.5, NA, 7]) 

In [17]: data.dropnaQ 
0ut[17] : 

0 1.0 
2 3.5 

4 7.0 

dtype: float64 

This is equivalent to: 

In [18]: data[data.notnull()] 

0ut[18] : 

0 1.0 
2 3.5 

4 7.0 

dtype: float64 

With DataFrame objects, things are a bit more complex. You may want to drop rows 
or columns that are all NA or only those containing any NAs. dropna by default drops 
any row containing a missing value: 

In [19]: data = pd.DataFrane([[l., 6.5, 3.], [1., NA, NA], 

_: [NA, NA, NA], [NA, 6.5, 3.]]) 

In [20]: cleaned = data.dropna( ) 

In [21]: data 
0ut[21] : 

0 12 
0 1.0 6.5 3.0 

1 1.0 NaN NaN 

2 NaN NaN NaN 

3 NaN 6.5 3.0 

In [22]: cleaned 
0ut[22] : 

0 12 
0 1.0 6.5 3.0 

Passing how=' all' will only drop rows that are all NA: 

In [23]: data.dropna(how='all' ) 

0ut[23] : 

0 12 
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0 

1.0 

6.5 

3.0 

1 

1.0 

NaN 

NaN 

3 

NaN 

6.5 

3.0 


To drop columns in the same way, pass axis=l: 
In [24]: data[4] = NA 
In [25]: data 


0ut[25] 

0 

1 

2 

4 

0 

1.0 

6.5 

3.0 

NaN 

1 

1.0 

NaN 

NaN 

NaN 

2 

NaN 

NaN 

NaN 

NaN 

3 

NaN 

6.5 

3.0 

NaN 


In [26]: data,dropna(axis=l, how='all') 


0ut[26] 

0 

1 

2 

0 

1.0 

6.5 

3.0 

1 

1.0 

NaN 

NaN 

2 

NaN 

NaN 

NaN 

3 

NaN 

6.5 

3.0 


A related way to filter out DataFrame rows tends to concern time series data. Suppose 
you want to keep only rows containing a certain number of observations. You can 
indicate this with the thresh argument: 

In [27]: df = pd.DataFrane(np.random.randn(7, 3)) 

In [28]: df.iloc[:4, 1] = NA 


In [29]: df.iloc[:2, 2] = NA 


In [30]: df 
Out[30] : 



0 

1 

2 

0 

-0.204708 

NaN 

NaN 

1 

-0.555730 

NaN 

NaN 

2 

0.092908 

NaN 

0.769023 

3 

1.246435 

NaN 

-1.296221 

4 

0.274992 

0.228913 

1.352917 

5 

0.886429 

-2.001637 

-0.371843 

6 

1.669025 

-0.438570 

-0.539741 

In 

[31]: df. 

dropna() 



0ut[31] : 

0 12 

4 0.274992 0.228913 1.352917 

5 0.886429 -2.001637 -0.371843 

6 1.669025 -0.438570 -0.539741 


In [32]: df.dropna(thresh=2) 
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Out [32]: 


2 

3 

4 

5 

6 


0 

0.092908 

1.246435 

0.274992 

0.886429 

1.669025 


1 

NaN 

NaN 

0.228913 

-2.001637 

-0.438570 


2 

0.769023 

-1.296221 

1.352917 

-0.371843 

-0.539741 


Filling In Missing Data 

Rather than filtering out missing data (and potentially discarding other data along 
with it), you may want to fill in the “holes” in any number of ways. For most pur¬ 
poses, the fillna method is the workhorse function to use. Calling fillna with a 
constant replaces missing values with that value: 


In [33]: df.fitlna(0) 
0ut[33] : 



0 

1 

2 

0 

-0.204708 

0.000000 

0.000000 

1 

-0.555730 

0.000000 

0.000000 

2 

0.092908 

0.000000 

0.769023 

3 

1.246435 

0.000000 

-1.296221 

4 

0.274992 

0.228913 

1.352917 

5 

0.886429 

-2.001637 

-0.371843 

6 

1.669025 

-0.438570 

-0.539741 


Calling fillna with a diet, you can use a different fill value for each column: 


In 

[34]: df .fillna({l: 

0.5, 2: 1 

0ut[34] : 




0 

1 

2 

0 

-0.204708 

0.500000 

0.000000 

1 

-0.555730 

0.500000 

0.000000 

2 

0.092908 

0.500000 

0.769023 

3 

1.246435 

0.500000 

-1.296221 

4 

0.274992 

0.228913 

1.352917 

5 

0.886429 

-2.001637 

-0.371843 

6 

1.669025 

-0.438570 

-0.539741 


fillna returns a new object, but you can modify the existing object in-place: 

In [35]: _ = df.ftllna(0, inplace=True) 


In [36]: df 
0ut[36] : 

0 1 
0 -0.204708 0.000000 

1 -0.555730 0.000000 

2 0.092908 0.000000 

3 1.246435 0.000000 

4 0.274992 0.228913 

5 0.886429 -2.001637 

6 1.669025 -0.438570 


2 

0.000000 

0.000000 

0.769023 

-1.296221 

1.352917 

-0.371843 

-0.539741 
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The same interpolation methods available for reindexing can be used with f Ulna: 

In [37]: df = pd.DataFrane(np.random.randn(6, 3)) 

In [38]: df.iloc[2:, 1] = NA 
In [39]: df.iloc[4:, 2] = NA 


In [40]: df 
Out[40] : 


0 

1 

2 

0 0.476985 

3.248944 

-1.021228 

1 -0.577087 

0.124121 

0.302614 

2 0.523772 

NaN 

1.343810 

3 -0.713544 

NaN 

-2.370232 

4 -1.860761 

NaN 

NaN 

5 -1.265934 

NaN 

NaN 


In [41]: df.fillna(method=' ffilt 1 ) 
0ut[41] : 

0 12 
0 0.476985 3.248944 -1.021228 

1 -0.577087 0.124121 0.302614 

2 0.523772 0.124121 1.343810 

3 -0.713544 0.124121 -2.370232 

4 -1.860761 0.124121 -2.370232 

5 -1.265934 0.124121 -2.370232 


In [42]: df.fillna(method= 'ffilt' , linit=2) 
0ut[42] : 


0 

0 0.476985 

1 -0.577087 

2 0.523772 

3 -0.713544 

4 -1.860761 

5 -1.265934 


1 2 
3.248944 -1.021228 
0.124121 0.302614 

0.124121 1.343810 

0.124121 -2.370232 
NaN -2.370232 
NaN -2.370232 


With fillna you can do lots of other things with a little creativity. For example, you 
might pass the mean or median value of a Series: 

In [43]: data = pd.Series( [1., NA, 3.5, NA, 7]) 


In [44]: data.fillna(data.meanQ) 
0ut[44] : 

0 1.000000 

1 3.833333 

2 3.500000 

3 3.833333 

4 7.000000 
dtype: float64 

See Table 7-2 for a reference on fillna. 
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Table 7-2. fillna function arguments 


Argument Description 


value Scalar value or dict-like object to use to fill missing values 

method Interpolation; by default' ffi.ll' if function called with no other arguments 

axis Axis to fill on; default axis=0 

inplace Modify the calling object without producing a copy 

limit For forward and backward filling, maximum number of consecutive periods to fill 


7.2 Data Transformation 

So far in this chapter we’ve been concerned with rearranging data. Filtering, cleaning, 
and other transformations are another class of important operations. 

Removing Duplicates 

Duplicate rows may be found in a DataFrame for any number of reasons. Here is an 
example: 

In [45]: data = pd.DataFrame({ 'kl' : ['one', 'two'] * 3 + ['two'], 

....: ' k2' : [1, 1, 2, 3, 3, 4, 4]}) 


In [46]: data 
0ut[46] : 

kl k2 
0 one 1 

1 two 1 

2 one 2 

3 two 3 

4 one 3 

5 two 4 

6 two 4 

The DataFrame method duplicated returns a boolean Series indicating whether each 
row is a duplicate (has been observed in a previous row) or not: 

In [47]: data.dupllcatedQ 
0ut[47] : 

0 False 

1 False 

2 False 

3 False 

4 False 

5 False 

6 T rue 
dtype: bool 

Relatedly, drop_duplicates returns a DataFrame where the duplicated array is 
False: 
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In [48]: data.drop_duplicates() 

0ut[48] : 

kl k2 
0 one 1 

1 two 1 

2 one 2 

3 two 3 

4 one 3 

5 two 4 

Both of these methods by default consider all of the columns; alternatively, you can 
specify any subset of them to detect duplicates. Suppose we had an additional column 
of values and wanted to filter duplicates only based on the 1 kl' column: 

In [49]: dataf'vl'] = range(7) 

In [50]: data,drop_duplicates( [’ kl' ]) 

Out[50] : 

kl k2 vl 

0 one 1 0 

1 two 1 1 

duplicated and drop_duplicates by default keep the first observed value combina¬ 
tion. Passing keep=' last' will return the last one: 

In [51]: data,drop_duplicates( [ 1 kl' , 'k2'] , keep='last') 

0ut[51] : 

kl k2 vl 

0 one 1 0 

1 two 1 1 

2 one 2 2 

3 two 3 3 

4 one 3 4 

6 two 4 6 

Transforming Data Using a Function or Mapping 

For many datasets, you may wish to perform some transformation based on the val¬ 
ues in an array, Series, or column in a DataFrame. Consider the following hypotheti¬ 
cal data collected about various kinds of meat: 

In [52]: data = pd.DataFrane({ 1 food’ : ['bacon', 'putted pork', 'bacon', 

....: 'Pastrami', 'corned beef', 'Bacon', 

....: 'pastrami', 'honey ham', 'nova lox'], 

_: 'ounces': [4, 3, 12, 6, 7.5, 8, 3, 5, 6]}) 

In [53]: data 
0ut[53] : 

food ounces 

0 bacon 4.0 

1 pulled pork 3.0 

2 bacon 12.0 
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3 

Pastrami 

6.0 

4 

corned beef 

7.5 

5 

Bacon 

8.0 

6 

pastrami 

3.0 

7 

honey ham 

5.0 

8 

nova lox 

6.0 


Suppose you wanted to add a column indicating the type of animal that each food 
came from. Let’s write down a mapping of each distinct meat type to the kind of 
animal: 

meat_to_animal = { 

'bacon' : 1 pig' , 

'pulled pork' : 'pig 1 , 

'pastrami' : 'cow' , 

'corned beef' : 'cow' , 

'honey ham' : 'pig' , 

'nova lox' : 'salmon' 

} 

The map method on a Series accepts a function or dict-like object containing a map¬ 
ping, but here we have a small problem in that some of the meats are capitalized and 
others are not. Thus, we need to convert each value to lowercase using the str. lower 
Series method: 

In [55]: lowercased = data[' food ']. str.lower( ) 

In [56]: lowercased 
0ut[56] : 

0 bacon 

1 pulled pork 

2 bacon 

3 pastrami 

4 corned beef 

5 bacon 

6 pastrami 

7 honey ham 

8 nova lox 

Name: food, dtype: object 

In [57]: data[ 'animal' ] = lowercased.map(meat_to_animal) 


In [58]: data 
0ut[58] : 



food 

ounces 

animal 

0 

bacon 

4.0 

pig 

1 

pulled pork 

3.0 

pig 

2 

bacon 

12.0 

pig 

3 

Pastrami 

6.0 

cow 

4 

corned beef 

7.5 

cow 

5 

Bacon 

8.0 

pig 

6 

pastrami 

3.0 

cow 
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7 honey ham 5.0 pig 

8 nova lox 6.0 salmon 


We could also have passed a function that does all the work: 


In [59]: data[ 'food 1 ] ,map(lambda x: meat_to_anlmal[x.lower()]) 
0ut[59] : 


0 pig 

1 pig 

pig 


0 

1 

2 

3 

4 

5 

6 
7 


3 cow 


cow 


prg 

cow 


pig 


8 salmon 

Name: food, dtype: object 

Using map is a convenient way to perform element-wise transformations and other 
data cleaning-related operations. 


Replacing Values 


Filling in missing data with the fillna method is a special case of more general value 
replacement. As you’ve already seen, nap can be used to modify a subset of values in 
an object but replace provides a simpler and more flexible way to do so. Let’s con¬ 
sider this Series: 

In [60]: data = pd.Series([l. , -999., 2., -999., -1000., 3.]) 

In [61]: data 
0ut[61] : 

0 1.0 

1 -999.0 

2 2.0 

3 -999.0 

4 -1000.0 

5 3.0 
dtype: float64 

The -999 values might be sentinel values for missing data. To replace these with NA 
values that pandas understands, we can use replace, producing a new Series (unless 
you pass inplace=True): 

In [62]: data. replace(-999, np.nan) 

0ut[62] : 

0 1.0 

1 NaN 

2 2.0 

3 NaN 

4 -1000.0 
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5 3.0 

dtype: float64 

If you want to replace multiple values at once, you instead pass a list and then the 
substitute value: 

In [63]: data.reptace( [ -999, -1000], np.nan) 

0ut[63] : 

0 1.0 

1 NaN 

2 2.0 

3 NaN 

4 NaN 

5 3.0 
dtype: float64 

To use a different replacement for each value, pass a list of substitutes: 

In [64]: data.reptace( [ -999, -1000], [np.nan, 0]) 

0ut[64] : 

0 1.0 

1 NaN 

2 2.0 

3 NaN 

4 0.0 

5 3.0 
dtype: float64 

The argument passed can also be a diet: 

In [65]: data. reptace({-999: np.nan, -1000: 0}) 

0ut[65] : 

0 1.0 

1 NaN 

2 2.0 

3 NaN 

4 0.0 

5 3.0 
dtype: float64 



The data, replace method is distinct from data.str. replace, 
which performs string substitution element-wise. We look at these 
string methods on Series later in the chapter. 


Renaming Axis Indexes 

Like values in a Series, axis labels can be similarly transformed by a function or map¬ 
ping of some form to produce new, differently labeled objects. You can also modify 
the axes in-place without creating a new data structure. Here’s a simple example: 
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In [66]: data = pd.DataFramefnp.arange(12) . reshape((3, 4)), 

index=[ 'Ohio' , 'Colorado', 'New York'], 
colunns=[ 'one' , 'two', 'three', 'four']) 

Like a Series, the axis indexes have a nap method: 

In [67]: transform = lambda x: x[:4]. upper( ) 

In [68]: data.index.map(transform) 

0ut[68] : Index( [ 'OHIO' , 'COLO', 'NEW '], dtype= 1 object 1 ) 

You can assign to index, modifying the DataFrame in-place: 

In [69]: data.index = data.index.map(transform) 


In [70]: data 
Out[70] : 

one two three four 

OHIO 0 1 2 3 

COLO 4 5 6 7 

NEW 8 9 10 11 

If you want to create a transformed version of a dataset without modifying the origi¬ 
nal, a useful method is rename: 

In [71]: data.rename(index=str.title, columns=str . upper) 

0ut[71] : 

ONE TWO THREE FOUR 

Ohio 0 1 2 3 

Colo 4 5 6 7 

New 89 10 11 

Notably, rename can be used in conjunction with a dict-like object providing new val¬ 
ues for a subset of the axis labels: 


In [72] 


0ut[72] 

INDIANA 

COLO 

NEW 


data.rename(index={ 'OHIO' : 1 INDIANA' }, 

columns={ 'three' : 'peekaboo'}) 

one two peekaboo four 
0 1 2 3 

4 5 6 7 

89 10 11 


rename saves you from the chore of copying the DataFrame manually and assigning 
to its index and columns attributes. Should you wish to modify a dataset in-place, 
pass inplace=True: 

In [73]: data.rename(index={ 'OHIO' : 'INDIANA'}, inplace=True) 


In [74]: data 
0ut[74] : 

one two three four 
INDIANA 01 23 
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COLO 4 5 6 7 

NEW 89 10 11 

Discretization and Binning 

Continuous data is often discretized or otherwise separated into “bins” for analysis. 
Suppose you have data about a group of people in a study, and you want to group 
them into discrete age buckets: 

In [75]: ages = [20, 22, 25, 27, 21, 23, 37, 31, 61, 45, 41, 32] 

Let’s divide these into bins of 18 to 25, 26 to 35, 36 to 60, and finally 61 and older. To 
do so, you have to use cut, a function in pandas: 

In [76]: bins = [18, 25, 35, 60, 100] 

In [77]: cats = pd.cut(ages, bins) 

In [78]: cats 
0ut[78] : 

[(18, 25], (18, 25], (18, 25], (25, 35], (18, 25], ..., (25, 35], (60, 100], (35, 
60], (35, 60], (25, 35]] 

Length: 12 

Categories (4, interval[int64] ): [(18, 25] < (25, 35] < (35, 60] < (60, 100]] 

The object pandas returns is a special Categorical object. The output you see 
describes the bins computed by pandas.cut. You can treat it like an array of strings 
indicating the bin name; internally it contains a categories array specifying the dis¬ 
tinct category names along with a labeling for the ages data in the codes attribute: 

In [79]: cats.codes 

0ut[79]: array([0, 0, 0, 1, 0, 0, 2, 1, 3, 2, 2, 1], dtype=int8) 

In [80]: cats.categories 
Out[80] : 

Intervailndex([(18, 25], (25, 35], (35, 60], (60, 100]] 
closed= 1 right' , 
dtype='interval[int64] 1 ) 

In [81]: pd.vaiue_counts(cats) 

0ut[81] : 

(18, 25] 5 

(35, 60] 3 

(25, 35] 3 

(60, 100] 1 

dtype: int64 

Note that pd. value_counts(cats) are the bin counts for the result of pandas, cut. 

Consistent with mathematical notation for intervals, a parenthesis means that the side 
is open, while the square bracket means it is closed (inclusive). You can change which 
side is closed by passing right=False: 
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In [82]: pd.cut(ages, [18, 26, 36, 61, 100], rlght=False) 

0ut[82] : 

[[18, 26), [18, 26), [18, 26), [26, 36), [18, 26), .... [26, 36), [61, 100), [36, 
61), [36, 61), [26, 36)] 

Length: 12 

Categories (4, interval[int64] ): [[18, 26) < [26, 36) < [36, 61) < [61, 100)] 

You can also pass your own bin names by passing a list or array to the labels option: 

In [83]: group_names = ['Youth 1 , ' YoungAdult' , ' MiddleAged' , 'Senior'] 

In [84]: pd.cut(ages, bins, labels=group_names) 

0ut[84] : 

[Youth, Youth, Youth, YoungAdult, Youth, ..., YoungAdult, Senior, MiddleAged, Mid 
dleAged, YoungAdult] 

Length: 12 

Categories (4, object): [Youth < YoungAdult < MiddleAged < Senior] 

If you pass an integer number of bins to cut instead of explicit bin edges, it will com¬ 
pute equal-length bins based on the minimum and maximum values in the data. 
Consider the case of some uniformly distributed data chopped into fourths: 

In [85]: data = np.random.rand(20) 

In [86]: pd.cut(data, 4, precision=2) 

0ut[86] : 

[(0.34, 0.55], (0.34, 0.55], (0.76, 0.97], (0.76, 0.97], (0.34, 0.55], ..., (0.34 
, 0.55], (0.34, 0.55], (0.55, 0.76], (0.34, 0.55], (0.12, 0.34]] 

Length: 20 

Categories (4, interval[float64] ): [(0.12, 0.34] < (0.34, 0.55] < (0.55, 0.76] < 
(0.76, 0.97]] 

The precision=2 option limits the decimal precision to two digits. 

A closely related function, qcut, bins the data based on sample quantiles. Depending 
on the distribution of the data, using cut will not usually result in each bin having the 
same number of data points. Since qcut uses sample quantiles instead, by definition 
you will obtain roughly equal-size bins: 

In [87]: data = np.random. randn(1000) # Normally distributed 

In [88]: cats = pd.qcut(data, 4) # Cut into quartiles 

In [89]: cats 
0ut[89] : 

[(-0.0265, 0.62], (0.62, 3.928], (-0.68, -0.0265], (0.62, 3.928], (-0.0265, 0.62] 
, ..., (-0.68, -0.0265], (-0.68, -0.0265], (-2.95, -0.68], (0.62, 3.928], (-0.68, 
-0.0265]] 

Length: 1000 

Categories (4, interval[float64] ): [(-2.95, -0.68] < (-0.68, -0.0265] < (-0.0265, 
0.62] < 

(0.62, 3.928]] 
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In [90]: pd.value_counts(cats) 

Out[90] : 

(0.62, 3.928] 250 

(-0.0265, 0.62] 250 

(-0.68, -0.0265] 250 

(-2.95, -0.68] 250 

dtype: tnt64 

Similar to cut you can pass your own quantiles (numbers between 0 and 1, inclusive): 

In [91]: pd.qcut(data, [0, 0.1, 0.5, 0.9, 1.]) 

0ut[91] : 

[(-0.0265, 1.286], (-0.0265, 1.286], (-1.187, -0.0265], (-0.0265, 1.286], (-0.026 
5, 1.286], ..., (-1.187, -0.0265], (-1.187, -0.0265], (-2.95, -1.187], (-0.0265, 
1.286], (-1.187, -0.0265]] 

Length: 1000 

Categories (4, interval[float64] ): [(-2.95, -1.187] < (-1.187, -0.0265] < (-0.026 
5, 1.286] < 

(1.286, 3.928]] 

We’ll return to cut and qcut later in the chapter during our discussion of aggregation 
and group operations, as these discretization functions are especially useful for quan¬ 
tile and group analysis. 

Detecting and Filtering Outliers 

Filtering or transforming outliers is largely a matter of applying array operations. 
Consider a DataFrame with some normally distributed data: 

In [92]: data = pd.DataFrame(np.random. randn(1000, 4)) 


In [93]: data.describeQ 
0ut[93] : 



0 

1 

2 

3 

count 

1000.000000 

1000.000000 

1000.000000 

1000.000000 

mean 

0.049091 

0.026112 

-0.002544 

-0.051827 

std 

0.996947 

1.007458 

0.995232 

0.998311 

min 

-3.645860 

-3.184377 

-3.745356 

-3.428254 

25% 

-0.599807 

-0.612162 

-0.687373 

-0.747478 

50% 

0.047101 

-0.013609 

-0.022158 

-0.088274 

75% 

0.756646 

0.695298 

0.699046 

0.623331 

max 

2.653656 

3.525865 

2.735527 

3.366626 


Suppose you wanted to find values in one of the columns exceeding 3 in absolute 
value: 

In [94]: cot = data[2] 

In [95]: col[np.abs(col) > 3] 

0ut[95] : 

41 -3.399312 

136 -3.745356 

Name: 2, dtype: ftoat64 
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To select all rows having a value exceeding 3 or -3, you can use the any method on a 
boolean DataFrame: 


In [96]: data[(np.abs(data) > 3).any(l)] 
0ut[96] : 



0 

1 

2 

3 

41 

0.457246 

-0.025907 

-3.399312 

-0.974657 

60 

1.951312 

3.260383 

0.963301 

1.201206 

136 

0.508391 

-0.196713 

-3.745356 

-1.520113 

235 

-0.242459 

-3.056990 

1.918403 

-0.578828 

258 

0.682841 

0.326045 

0.425384 

-3.428254 

322 

1.179227 

-3.184377 

1.369891 

-1.074833 

544 

-3.548824 

1.553205 

-2.186301 

1.277104 

635 

-0.578093 

0.193299 

1.397822 

3.366626 

782 

-0.207434 

3.525865 

0.283070 

0.544635 

803 

-3.645860 

0.255475 

-0.549574 

-1.907459 


Values can be set based on these criteria. Here is code to cap values outside the inter¬ 
val -3 to 3: 

In [97]: data[np.abs(data) > 3] = np. stgn(data) * 3 


In [98] 

: data.describe() 


0ut[98] 

0 

1 


count 

1000.000000 

1000.000000 

1000 

mean 

0.050286 

0.025567 

-0 

std 

0.992920 

1.004214 

0 

min 

-3.000000 

-3.000000 

-3 

25% 

-0.599807 

-0.612162 

-0 

50% 

0.047101 

-0.013609 

-0 

75% 

0.756646 

0.695298 

0 

max 

2.653656 

3.000000 

2 


The statement np.sign(data) produces 
in data are positive or negative: 

In [99]: np.sign(data).head( ) 

0ut[99] : 


0 

1 

2 3 

0 -1.0 

1.0 

-1.0 1.0 

1 1.0 

-1.0 

1.0 -1.0 

2 1.0 

1.0 

1.0 -1.0 

3 -1.0 

-1.0 

1.0 -1.0 

4 -1.0 

1.0 

-1.0 -1.0 


2 3 
000000 1000.000000 
001399 -0.051765 
991414 0.995761 
000000 -3.000000 
687373 -0.747478 
022158 -0.088274 
699046 0.623331 
735527 3.000000 

and -1 values based on whether the values 


Permutation and Random Sampling 

Permuting (randomly reordering) a Series or the rows in a DataFrame is easy to do 
using the numpy. random, permutation function. Calling permutation with the length 
of the axis you want to permute produces an array of integers indicating the new 
ordering: 
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In [100]: df = pd.DataFrame(np.arange(5 * 4). reshape((5, 4))) 


In [101]: sampler = np.random.permutatlon(5) 


In [102]: sampler 

Out[102]: array([3, 1, 4, 2, 0]) 

That array can then be used in iloc-based indexing or the equivalent take function: 


In [103]: df 
Out[103]: 


0 12 3 

0 0 12 3 

1 4 5 6 7 

2 8 9 10 11 

3 12 13 14 15 

4 16 17 18 19 

In [104]: df.take(sampler) 
Out[104] : 



0 

1 

2 

3 

3 

12 

13 

14 

15 

1 

4 

5 

6 

7 

4 

16 

17 

18 

19 

2 

8 

9 

10 

11 

0 

0 

1 

2 

3 


To select a random subset without replacement, you can use the sample method on 
Series and DataFrame: 


In [105]: df,sample(n=3) 

Out[105] : 

0 12 3 

3 12 13 14 15 

4 16 17 18 19 

2 8 9 10 11 

To generate a sample with replacement (to allow repeat choices), pass replace=True 
to sample: 

In [106]: choices = pd.Series( [5, 1, -1, 6, 4]) 

In [107]: draws = choices. sample(n=10, replace=True) 

In [108]: draws 
Out[108] : 

4 4 

1 7 

4 4 

2 -1 

0 5 

3 6 
1 7 
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4 4 

0 5 

4 4 

dtype: lnt64 

Computing Indicator/Dummy Variables 

Another type of transformation for statistical modeling or machine learning applica¬ 
tions is converting a categorical variable into a “dummy” or “indicator” matrix. If a 
column in a DataFrame has k distinct values, you would derive a matrix or Data- 
Frame with k columns containing all Is and Os. pandas has a get_dumrnies function 
for doing this, though devising one yourself is not difficult. Let’s return to an earlier 
example DataFrame: 

In [109]: df = pd .DataFrame({ 'key' : ['b', 'b', 'a', 'c', 'a', 'b' ], 

.: 'datal 1 : range(6)}) 

In [110]: pd.get_dummies(df [ 1 key’ ]) 

Out[110]: 

a b c 
0 0 10 
10 10 
2 10 0 

3 0 0 1 

4 10 0 

5 0 10 

In some cases, you may want to add a prefix to the columns in the indicator Data¬ 
Frame, which can then be merged with the other data. get_dunmies has a prefix argu¬ 
ment for doing this: 

In [111]: dummies = pd.get_dummles(df [' key 1 ], prefix=' key' ) 

In [112]: df_wtth_dummy = df [[' datal 1 ]]. joln(dummles) 


In [113]: df_wtth_dummy 
0ut[113] : 

datal key_a key_b key_c 


0 0 

1 1 

2 2 

3 3 

4 4 

5 5 


0 10 
0 10 
10 0 
0 0 1 
10 0 
0 10 


If a row in a DataFrame belongs to multiple categories, things are a bit more compli¬ 
cated. Let’s look at the MovieLens 1M dataset, which is investigated in more detail in 
Chapter 14: 
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In [114]: mnames = ['movie_id', 'title', 'genres'] 

In [115]: movies = pd.read_table(' datasets/movielens/movies.dat' , sep='::', 
.: header=None, names=mnames) 


In [116]: movies[:16] 
0ut[116]: 



movie_id 


title 

genres 

0 

1 

Toy Story 

(1995) 

Animation|Children' s|Comedy 

1 

2 

Jumanji 

(1995) 

Adventure | Children' s|Fantasy 

2 

3 

Grumpier Old Men 

(1995) 

Comedy | Romance 

3 

4 

Waiting to Exhale 

(1995) 

Comedy | Drama 

4 

5 

Father of the Bride Part II 

(1995) 

Comedy 

5 

6 

Heat 

(1995) 

Action|Crime|Thriller 

6 

7 

Sabrina 

(1995) 

Comedy | Romance 

7 

8 

Tom and Huck 

(1995) 

Adventure|Children 's 

8 

9 

Sudden Death 

(1995) 

Action 

9 

10 

GoldenEye 

(1995) 

Action | Adventure|Thriller 


Adding indicator variables for each genre requires a little bit of wrangling. First, we 
extract the list of unique genres in the dataset: 

In [117]: all_genres = [] 

In [118]: for x in movies.genres : 

.: all_genres.extend(x.split(' | ')) 

In [119]: genres = pd . unique(all_genres) 

Now we have: 

In [120]: genres 
Out[120] : 

array( [ 'Animation' , "Children's", 'Comedy', 'Adventure', 'Fantasy 1 , 

'Romance 1 , 'Drama', 'Action', 'Crime', 'Thriller', 'Horror', 

'Sci-Fi', 'Documentary', 'War', 'Musical', 'Mystery', 'Film-Noir', 
'Western'], dtype=object) 

One way to construct the indicator DataFrame is to start with a DataFrame of all 
zeros: 

In [121]: zero_matrix = np.zeros((len(movies), len(genres))) 

In [122]: dummies = pd.DataFrame(zero_matrix, columns=genres) 

Now, iterate through each movie and set entries in each row of dummies to 1. To do 
this, we use the dummies. columns to compute the column indices for each genre: 
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In [123]: gen = movies. genres[0] 

In [124]: gen.split(' | ' ) 

0ut[124]: ['Animation', "Children's", 'Comedy'] 


In [125]: dummies.columns.get_indexer(gen.split( '| ' )) 
0ut[125]: array([0, 1, 2]) 

Then, we can use . iloc to set values based on these indices: 


In [126] 


for i, gen in enumerate(movies .genres) : 

indices = dummies.columns.get_indexer(gen.split( '|' )) 
dummies. iloc[i, indices] = 1 


Then, as before, you can combine this with movies: 

In [127]: movies_windic = movies.join(dummies.add_prefix( 'Genre_' )) 


In [128]: movies_windic.iloc[0] 
0ut[128] : 


movie_id 

title 

genres 

Genre_Animation 

Genre_Children 's 

Genre_Comedy 

Genre_Adventure 

Genre_Fantasy 

Genre_Romance 

Genre Drama 


1 

Toy Story (1995) 
Animation | Children' s|Comedy 

1 

1 

1 

0 

0 

0 

0 


Genre_Crime 0 
Genre_Thriller 0 
GenreJHorror 0 
Genre_Sci-Fi 0 
Genre_Documentary 0 
Genre_War 0 
Genre_Musical 0 
Genre_Mystery 0 
Genre_Film-Noir 0 
Genre_Western 0 


Name: 0, Length: 21, dtype: object 



For much larger data, this method of constructing indicator vari¬ 
ables with multiple membership is not especially speedy. It would 
be better to write a lower-level function that writes directly to a 
NumPy array, and then wrap the result in a DataFrame. 


A useful recipe for statistical applications is to combine get_dummies with a discreti¬ 
zation function like cut: 
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In [129]: np.random. seed(12345) 


In [130]: values = np.random. rand(10) 

In [131]: values 
0ut[131] : 

array( [ 0.9296, 0.3164, 0.1839, 0.2046, 0.5677, 0.5955, 0.9645, 

0.6532, 0.7489, 0.6536]) 

In [132]: bins = [0, 0.2, 0.4, 0.6, 0.8, 1] 


In [133]: pd.get_dummies(pd.cut(values, bins)) 
0ut[133]: 



(0.0, 0.2] 

(0.2, 0.4] 

(0.4, 0.6] 

(0.6, 0.8] 

(0.8, 1.0] 

0 

0 

0 

0 

0 

1 

1 

0 

1 

0 

0 

0 

2 

1 

0 

0 

0 

0 

3 

0 

1 

0 

0 

0 

4 

0 

0 

1 

0 

0 

5 

0 

0 

1 

0 

0 

6 

0 

0 

0 

0 

1 

7 

0 

0 

0 

1 

0 

8 

0 

0 

0 

1 

0 

9 

0 

0 

0 

1 

0 


We set the random seed with numpy. random. seed to make the example deterministic. 
We will look again at pandas. get_dummies later in the book. 

7.3 String Manipulation 

Python has long been a popular raw data manipulation language in part due to its 
ease of use for string and text processing. Most text operations are made simple with 
the string object’s built-in methods. For more complex pattern matching and text 
manipulations, regular expressions may be needed, pandas adds to the mix by ena¬ 
bling you to apply string and regular expressions concisely on whole arrays of data, 
additionally handling the annoyance of missing data. 

String Object Methods 

In many string munging and scripting applications, built-in string methods are suffi¬ 
cient. As an example, a comma-separated string can be broken into pieces with 
split: 

In [134]: val = 'a,b, guido' 

In [135]: val.spllt(' ,' ) 

0ut[135]: ['a', 'b' , 1 guido'] 

split is often combined with strip to trim whitespace (including line breaks): 
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In [136]: pieces = [x.stripQ for x In val.spllt( ', 1 )] 


In [137]: pieces 

0ut[137]: ['a', 'b' , 'guido'] 

These substrings could be concatenated together with a two-colon delimiter using 
addition: 

In [138]: first, second, third = pieces 

In [139]: first + + second + + third 

0ut[139]: 'a::b::guido' 

But this isn’t a practical generic method. A faster and more Pythonic way is to pass a 
list or tuple to the join method on the string '::': 

In [140]: '::'. join(pieces) 

Out[140]: 'a::b::guido' 

Other methods are concerned with locating substrings. Using Python’s in keyword is 
the best way to detect a substring, though index and find can also be used: 

In [141]: 'guido' in val 
0ut[141]: True 

In [142]: val.index(' ,' ) 

0ut[142]: 1 

In [143] : val.find( ' : ' ) 

0ut[143]: -1 

Note the difference between find and index is that index raises an exception if the 
string isn’t found (versus returning -1): 

In [144]: val.index(' :' ) 


ValueError Traceback (most recent call last) 

<ipython-input-144-280f8b2856ce> in <nodule>() 

-> 1 val.index(' :' ) 

ValueError: substring not found 

Relatedly, count returns the number of occurrences of a particular substring: 

In [145]: val.count(' ,' ) 

0ut[145]: 2 

replace will substitute occurrences of one pattern for another. It is commonly used 
to delete patterns, too, by passing an empty string: 

In [146]: val.replace! ',' , '::') 

0ut[146]: 'a::b:: guido' 

In [147]: val.replace( ',' , '') 

0ut[147]: 'ab guido' 
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See Table 7-3 for a listing of some of Pythons string methods. 

Regular expressions can also be used with many of these operations, as you’ll see. 

Table 7-3. Python built-in string methods 


Argument Description 


count 

endswith 

startswtth 

join 

index 

find 

rflnd 

replace 

strip, 

rstrip, 

Istrip 

split 

lower 

upper 

casefold 

ljust, 

rjust 


Return the number of non-overlapping occurrences of substring in the string. 

Returns T rue if string ends with suffix. 

Returns T rue if string starts with prefix. 

Use string as delimiter for concatenating a sequence of other strings. 

Return position of first character in substring if found in the string; raises ValueEr ror if not found. 
Return position of first character of first occurrence of substring in the string; like index, but returns -1 
if not found. 

Return position of first character of tasf occurrence of substring in the string; returns -1 if not found. 
Replace occurrences of string with another string. 

Trim whitespace, including newlines; equivalent to x.strip() (and rstrip, Istrip, respectively) 
for each element. 

Break string into list of substrings using passed delimiter. 

Convert alphabet characters to lowercase. 

Convert alphabet characters to uppercase. 

Convert characters to lowercase, and convert any region-specific variable character combinations to a 
common comparable form. 

Left justify or right justify, respectively; pad opposite side of string with spaces (or some other fill 
character) to return a string with a minimum width. 


Regular Expressions 

Regular expressions provide a flexible way to search or match (often more complex) 
string patterns in text. A single expression, commonly called a regex, is a string 
formed according to the regular expression language. Pythons built-in re module is 
responsible for applying regular expressions to strings; I’ll give a number of examples 
of its use here. 



The art of writing regular expressions could be a chapter of its own 
and thus is outside the book’s scope. There are many excellent tuto¬ 
rials and references available on the internet and in other books. 


The re module functions fall into three categories: pattern matching, substitution, 
and splitting. Naturally these are all related; a regex describes a pattern to locate in the 
text, which can then be used for many purposes. Let’s look at a simple example: 


7.3 String Manipulation | 213 







suppose we wanted to split a string with a variable number of whitespace characters 
(tabs, spaces, and newlines). The regex describing one or more whitespace characters 
is \s+: 

In [148]: import 

In [149]: text = "foo bar\t baz \tqux" 

In [150]: re.split(' \s+' , text) 

Out[150]: ['foo', 'bar', 'baz', 'qux'] 

When you call re. split(' \s+', text), the regular expression is first compiled, and 
then its split method is called on the passed text. You can compile the regex yourself 
with re. compile, forming a reusable regex object: 

In [151]: regex = re,compile( '\s+' ) 

In [152]: regex.split(text) 

0ut[152]: ['foo 1 , 'bar', 'baz', 'qux'] 

If, instead, you wanted to get a list of all patterns matching the regex, you can use the 
findall method: 

In [153]: regex.findall(text) 

0ut[153]: [' '\t ' \t'] 



To avoid unwanted escaping with \ in a regular expression, use raw 
string literals like r 1 C: \ x 1 instead of the equivalent ' C: \ \x'. 


Creating a regex object with re.compile is highly recommended if you intend to 
apply the same expression to many strings; doing so will save CPU cycles. 

match and search are closely related to findall. While findall returns all matches 
in a string, search returns only the first match. More rigidly, match only matches at 
the beginning of the string. As a less trivial example, let’s consider a block of text and 
a regular expression capable of identifying most email addresses: 

text = '"'"Dave dave@googte.com 
Steve steve@gmall.com 
Rob rob@gmail.com 
Ryan ryan@yahoo.com 

pattern = r' [A-Z0-9]+@[A-Z0-9.-]+\.[A-Z]{2,4]' 

# re.ICNORECASE nakes the regex case-insensitive 
regex = re.compile(pattern, flags=re.IGNORECASE) 

Using findall on the text produces a list of the email addresses: 
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In [155]: regex.findall(text) 

Out[ 155] : 

[ 1 dave@googte.con' , 

'steve@gmail.com' , 

'rob@gnail.con' , 

'ryan@yahoo.con' ] 

search returns a special match object for the first email address in the text. For the 
preceding regex, the match object can only tell us the start and end position of the 
pattern in the string: 

In [156]: n = regex.search(text) 

In [157]: n 

Out[157]: <_sre.SRE_Match object; span=(5, 20), natch='dave@google.con'> 

In [158]: text[m. startQ:m.end( )] 

Out[158]: 'dave@googte.con' 

regex .natch returns None, as it only will match if the pattern occurs at the start of the 
string: 

In [159]: print(regex.natch(text)) 

None 

Relatedly, sub will return a new string with occurrences of the pattern replaced by the 
a new string: 

In [160]: print( regex.sub( 'REDACTED 1 , text)) 

Dave REDACTED 
Steve REDACTED 
Rob REDACTED 
Ryan REDACTED 

Suppose you wanted to find email addresses and simultaneously segment each 
address into its three components: username, domain name, and domain suffix. To 
do this, put parentheses around the parts of the pattern to segment: 

In [161]: pattern = r '([A-Z0-9]+)@([A-Z0-9.-]+)\.([A-Z]{2,4})' 

In [162]: regex = re.compile(pattern, flags=re.IGNORECASE) 

A match object produced by this modified regex returns a tuple of the pattern com¬ 
ponents with its groups method: 

In [163]: m = regex.match(' wesm@bright.net 1 ) 

In [164]: m.groups() 

0ut[164]: (’wesm 1 , 'bright 1 , 'net') 

findall returns a list of tuples when the pattern has groups: 

In [165]: regex.findall(text) 

Out[165]: 
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[('dave', 'google', 'con'), 

('steve', 'gnail', 'con'), 

( 'rob' , 'gnail’ , 1 con 1 ), 

('ryan', 'yahoo', 'con')] 

sub also has access to groups in each match using special symbols like \1 and \2. The 
symbol \1 corresponds to the first matched group, \2 corresponds to the second, and 
so forth: 

In [166]: print( regex. sub(r 'Username: \1, Domain: \2, Suffix: \3', text)) 

Dave Usernane: dave, Donain: google. Suffix: con 
Steve Usernane: steve, Donain: gnail. Suffix: con 
Rob Usernane: rob, Donain: gnail. Suffix: con 
Ryan Usernane: ryan, Donain: yahoo. Suffix: con 

There is much more to regular expressions in Python, most of which is outside the 
book’s scope. Table 7-4 provides a brief summary. 


Table 7-4. Regular expression methods 


Argument Description 


findall 

finditer 

natch 

search 

split 
sub, subn 


Return all non-overlapping matching patterns in a string as a list 
Like findall, but returns an iterator 

Match pattern at start of string and optionally segment pattern components into groups; if the pattern 
matches, returns a match object, and otherwise None 

Scan string for match to pattern; returning a match object if so; unlike match, the match can be anywhere in 
the string as opposed to only at the beginning 
Break string into pieces at each occurrence of pattern 

Replace all (sub) or first n occurrences (subn) of pattern in string with replacement expression; use symbols 
\1, \2, ... to refer to match group elements in the replacement string 


Vectorized String Functions in pandas 

Cleaning up a messy dataset for analysis often requires a lot of string munging and 
regularization. To complicate matters, a column containing strings will sometimes 
have missing data: 

In [167]: data = {'Dave': 'dave@google.com 1 , 'Steve': 'steve@gmail.com', 

. : 'Rob': 'rob@gmail.com', 'Wes': np.nan} 

In [168]: data = pd.Series(data) 

In [169]: data 
0ut[169]: 

Dave dave@google.com 

Rob rob@gmail.com 

Steve steve@gmail.com 

Wes NaN 

dtype: object 
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In [170]: data.isnull() 

Out[170] : 

Dave False 

Rob False 

Steve False 

Wes True 

dtype: bool 

You can apply string and regular expression methods can be applied (passing a 
lambda or other function) to each value using data.map, but it will fail on the NA 
(null) values. To cope with this, Series has array-oriented methods for string opera¬ 
tions that skip NA values. These are accessed through Series’s str attribute; for exam¬ 
ple, we could check whether each email address has 'gmail' in it with str.contains: 

In [171]: data.str.contains( 'gmail' ) 

0ut[171] : 

Dave False 

Rob True 

Steve True 

Wes NaN 

dtype: object 

Regular expressions can be used, too, along with any re options like IGNORECASE: 

In [172]: pattern 

Out [172]: 1 ([A-Z0-9 ._%+-]+)(§( [A-Z0-9. - ]+)\\.([A-Z]{2,4})' 

In [173]: data.str.findall(pattern, flags=re.IGNORECASE) 

0ut[173] : 

Dave [(dave, google, con)] 

Rob [(rob, gmail, con)] 

Steve [(steve, gmail, con)] 

Wes NaN 

dtype: object 

There are a couple of ways to do vectorized element retrieval. Either use str. get or 
index into the str attribute: 

In [174]: natches = data.str.match(pattern, flags=re.IGNORECASE) 

In [175]: natches 
0ut[175] : 

Dave True 

Rob True 

Steve True 

Wes NaN 

dtype: object 

To access elements in the embedded lists, we can pass an index to either of these 
functions: 

In [176]: natches.str. get(l) 

0ut[176] : 
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Dave NaN 

Rob NaN 

Steve NaN 

Wes NaN 

dtype: float64 

In [177]: matches. str[0] 

0ut[177] : 

Dave NaN 

Rob NaN 

Steve NaN 

Wes NaN 

dtype: float64 

You can similarly slice strings using this syntax: 

In [178]: data.str[:5] 

0ut[178] : 

Dave dave@ 

Rob rob@g 

Steve Steve 

Wes NaN 

dtype: object 

See Table 7-5 for more pandas string methods. 


Table 7-5. Partial listing of vectorized string methods 


Method Description 


cat 

contains 

count 

extract 

endswith 

startswith 

findall 

get 

isalnum 

tsalpha 

isdecimal 

isdiglt 

islower 

isnumeric 

isupper 

join 

ten 

tower, upper 


Concatenate strings element-wise with optional delimiter 
Return boolean array if each string contains pattern/regex 
Count occurrences of pattern 

Use a regular expression with groups to extract one or more strings from a Series of strings; the result 
will be a DataFrame with one column per group 

Equivalent to x. endswith (pattern) for each element 
Equivalent tox.startswith(pattern) for each element 
Compute list of all occurrences of pattern/regex for each string 
Index into each element (retrieve /-th element) 

Equivalent to built-in str. alnum 

Equivalent to built-in str. isalpha 

Equivalent to built-in str. isdecimal 

Equivalent to built-in str. isdigit 

Equivalent to built-in str. islower 

Equivalent to built-in str. isnumeric 

Equivalent to built-in str. isupper 

Join strings in each element of the Series with passed separator 

Compute length of each string 

Convert cases; equivalent to x. lower() or x. upper() for each element 
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Method 


Description 


natch 

pad 

center 

repeat 

replace 

slice 

split 

strip 

rstrip 

Istrip 


Use re. natch with the passed regular expression on each element, returning matched groups as list 
Add whitespace to left, right, or both sides of strings 
Equivalenttopad(side='both') 

Duplicate values (e.g.,s.str.repeat(3) is equivalent to x * 3 for each string) 

Replace occurrences of pattern/regex with some other string 

Slice each string in the Series 

Split strings on delimiter or regular expression 

Trim whitespace from both sides, including newlines 

Trim whitespace on right side 

Trim whitespace on left side 


7.4 Conclusion 

Effective data preparation can significantly improve productive by enabling you to 
spend more time analyzing data and less time getting it ready for analysis. We have 
explored a number of tools in this chapter, but the coverage here is by no means com¬ 
prehensive. In the next chapter, we will explore pandas’s joining and grouping func¬ 
tionality. 
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CHAPTER 8 


Data Wrangling: Join, Combine, 

and Reshape 


In many applications, data may be spread across a number of files or databases or be 
arranged in a form that is not easy to analyze. This chapter focuses on tools to help 
combine, join, and rearrange data. 

First, I introduce the concept of hierarchical indexing in pandas, which is used exten¬ 
sively in some of these operations. I then dig into the particular data manipulations. 
You can see various applied usages of these tools in Chapter 14. 

8.1 Hierarchical Indexing 

Hierarchical indexing is an important feature of pandas that enables you to have mul¬ 
tiple (two or more) index levels on an axis. Somewhat abstractly, it provides a way for 
you to work with higher dimensional data in a lower dimensional form. Let’s start 
with a simple example; create a Series with a list of lists (or arrays) as the index: 

In [9]: data = pd.Series(np.random.randn(9) , 

...: index=[ [ 1 a 1 , 'a', 'a 1 , 'b\ ' b' , 'c', 'c\ 'd 1 , ' d' ] , 

...: [1, 2, 3, 1, 3, 1, 2, 2, 3]]) 


In [10]: data 
Out[10] : 

a 1 -0.204708 

2 0.478943 

3 -0.519439 

b 1 -0.555730 

3 1.965781 

c 1 1.393406 

2 0.092908 

d 2 0.281746 
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3 0.769023 

dtype: float64 

What you’re seeing is a prettified view of a Series with a Multiindex as its index. The 
“gaps” in the index display mean “use the label directly above”: 

In [11]: data.index 
Out[ll] : 

Multilndex(levels=[[' a' , ' b' , 'c', 'd' ], [1, 2, 3]], 

labels=[[0, 0, 0, 1 , 1 , 2 , 2 , 3, 3], [0, 1 , 2 , 0, 2 , 0, 1 , 1 , 2]]) 

With a hierarchically indexed object, so-called partial indexing is possible, enabling 
you to concisely select subsets of the data: 

In [12]: data['b'] 

0ut[12] : 

1 -0.555730 

3 1.965781 

dtype: float64 

In [13]: data[ 'b' : 'c' ] 

0ut[13] : 

b 1 -0.555730 

3 1.965781 

c 1 1.393406 

2 0.092908 
dtype: float64 

In [14]: data,loc[ [ 1 b 1 , 'd' ]] 

0ut[14] : 

b 1 -0.555730 

3 1.965781 

d 2 0.281746 

3 0.769023 

dtype: float64 

Selection is even possible from an “inner” level: 

In [15]: data.loc[:, 2] 

0ut[15] : 
a 0.478943 
c 0.092908 
d 0.281746 
dtype: float64 

Hierarchical indexing plays an important role in reshaping data and group-based 
operations like forming a pivot table. For example, you could rearrange the data into 
a DataFrame using its unstack method: 

In [16]: data.unstack( ) 

0ut[16] : 

12 3 

a -0.204708 0.478943 -0.519439 

b -0.555730 NaN 1.965781 
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c 1.393406 0.092908 NaN 

d NaN 0.281746 0.769023 

The inverse operation of unstack is stack: 

In [17]: data.unstack( ). stackQ 
0ut[17] : 

a 1 -0.204708 

2 0.478943 

3 -0.519439 

b 1 -0.555730 

3 1.965781 

c 1 1.393406 

2 0.092908 

d 2 0.281746 

3 0.769023 
dtype: float64 

stack and unstack will be explored in more detail later in this chapter. 

With a DataFrame, either axis can have a hierarchical index: 

In [18]: frame = pd.DataFrame(np. arange(12).reshape((4, 3)), 

-: index=[ [ 1 a', 'a', 'b\ ' b' ], [1, 2, 1 , 2]], 

....: columns=[ [ 1 Ohio' , 'Ohio', 'Colorado'], 

....: ['Green', 'Red', 'Green']]) 


In [19]: frame 
0ut[19] : 

Ohio Colorado 
Green Red Green 
a 1 0 1 2 

2 3 4 5 

b 1 6 7 8 

2 9 10 11 

The hierarchical levels can have names (as strings or any Python objects). If so, these 
will show up in the console output: 

In [20]: frame.index.names = ['keyl', 'key2'] 

In [21]: frame.columns.names = ['state', 'color'] 

In [22]: frame 
0ut[22] : 

state Ohio Colorado 

color Green Red Green 

keyl key2 

a 1 0 1 2 

2 3 4 5 

b 1 6 7 8 

2 9 10 11 
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Be careful to distinguish the index names 'state' and 'color' 
from the row labels. 


With partial column indexing you can similarly select groups of columns: 

In [23]: frame[ 'Ohio' ] 

0ut[23] : 

color Green Red 

keyl key2 

a 1 0 1 

2 3 4 

b 1 6 7 

2 9 10 

A Multiindex can be created by itself and then reused; the columns in the preceding 
DataFrame with level names could be created like this: 

Multiindex.from_arrays( [[ ’Ohio 1 , 'Ohio 1 , ’Colorado 1 ], ['Green', 'Red', 'Green']], 
names: [' state 1 , 'color']) 

Reordering and Sorting Levels 

At times you will need to rearrange the order of the levels on an axis or sort the data 
by the values in one specific level. The swaplevel takes two level numbers or names 
and returns a new object with the levels interchanged (but the data is otherwise 
unaltered): 

In [24]: frame.swaplevel(' keyl' , 'key2') 


0ut[24] : 
state 

Ohio 


Colorado 

color 

Green 

Red 

Green 

key2 keyl 
1 a 

0 

1 

2 

2 a 

3 

4 

5 

1 b 

6 

7 

8 

2 b 

9 

10 

11 


sort_index, on the other hand, sorts the data using only the values in a single level. 
When swapping levels, it’s not uncommon to also use sort_index so that the result is 
lexicographically sorted by the indicated level: 

In [25]: frame.sort_index(level=l) 

0ut[25] : 

state Ohio Colorado 

color Green Red Green 

keyl key2 

a 1 0 1 2 

b 1 6 7 8 

a 2 3 4 5 
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b 


2 


9 10 


11 


In [26]: frame.swaplevel(0, 1). sort_index(level=0) 
0ut[26] : 

state Ohio Colorado 
color Green Red Green 

key2 keyl 

la 0 1 2 

b 6 7 8 

2 a 3 4 5 

b 9 10 11 



Data selection performance is much better on hierarchically 
indexed objects if the index is lexicographically sorted starting with 
the outermost level—that is, the result of calling 
sort_index(level=0) or sortindexQ. 


Summary Statistics by Level 

Many descriptive and summary statistics on DataFrame and Series have a level 
option in which you can specify the level you want to aggregate by on a particular 
axis. Consider the above DataFrame; we can aggregate by level on either the rows or 
columns like so: 

In [27]: frame.sum(level= 1 key2' ) 

0ut[27] : 

state Ohio Colorado 

color Green Red Green 

key2 

1 6 8 10 

2 12 14 16 

In [28]: frame.sum(level= 1 color' , axis=l) 

0ut[28] : 

color Green Red 

keyl key2 

a 1 2 1 

2 8 4 

b 1 14 7 

2 20 10 

Under the hood, this utilizes pandas’s groupby machinery, which will be discussed in 
more detail later in the book 

Indexing with a DataFrame's columns 

It’s not unusual to want to use one or more columns from a DataFrame as the row 
index; alternatively, you may wish to move the row index into the DataFrame’s col¬ 
umns. Here’s an example DataFrame: 
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In [29]: frame = pd.DataFrame({ 'a' : range(7), ' b' : range(7, 0, -1), 

'c' : ['one', 'one', 'one', 'two', 'two', 

....: 'two', 'two'], 

'd': [0, 1, 2, 0, 1, 2, 3]}) 

In [30]: frame 
Out[30] : 

a b c d 
007 one 0 

1 1 6 one 1 

2 2 5 one 2 

334 two 0 
443 two 1 
552 two 2 
661 two 3 

DataFrame’s set_index function will create a new DataFrame using one or more of 
its columns as the index: 

In [31]: frame2 = frame.set_index( [ 'c' , ' d' ]) 

In [32]: frame2 
0ut[32] : 

a b 

c d 

one 007 
116 
2 2 5 

two 034 
14 3 

2 5 2 

3 6 1 

By default the columns are removed from the DataFrame, though you can leave them 
in: 


In [33]: frame.set_index( [' c' , 'd' ], drop=False) 

0ut[33] : 

a b c d 

c d 

one 007 one 0 

1 1 6 one 1 

2 2 5 one 2 

two 034 two 0 

143 two 1 
252 two 2 
361 two 3 

reset_index, on the other hand, does the opposite of set_index; the hierarchical 
index levels are moved into the columns: 
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In [34]: frame2.reset_lndex( ) 

0ut[34] : 

c d a b 
0 one 007 

1 one 1 1 6 

2 one 2 2 5 

3 two 034 

4 two 143 

5 two 252 

6 two 361 

8.2 Combining and Merging Datasets 

Data contained in pandas objects can be combined together in a number of ways: 

• pandas.merge connects rows in DataFrames based on one or more keys. This 
will be familiar to users of SQL or other relational databases, as it implements 
database join operations. 

• pandas. concat concatenates or “stacks” together objects along an axis. 

• The combine_first instance method enables splicing together overlapping data 
to fill in missing values in one object with values from another. 

I will address each of these and give a number of examples. They’ll be utilized in 
examples throughout the rest of the book. 

Database-Style DataFrame Joins 

Merge or join operations combine datasets by linking rows using one or more keys. 
These operations are central to relational databases (e.g., SQL-based). The merge 
function in pandas is the main entry point for using these algorithms on your data. 

Let’s start with a simple example: 

In [35]: dfl = pd.DataFrame({ 1 key' : ['b', 'b', 'a', 'c' , 'a', 'a 1 , 'b'], 

....: 'datal': range(7)}) 

In [36]: df2 = pd.DataFrame({ 1 key' : ['a', 'b', ' d' ], 

....: 'data2': range(3)}) 

In [37]: dfl 
0ut[37] : 

datal key 
0 0 b 

1 lb 

2 2 a 

3 3 c 

4 4 a 

5 5 a 
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6 


6 b 


In [38]: df2 
Out[38] : 

data2 key 
0 0 a 

1 lb 

2 2d 

This is an example of a many-to-one join; the data in dfl has multiple rows labeled a 
and b, whereas df 2 has only one row for each value in the key column. Calling merge 
with these objects we obtain: 

In [39]: pd.merge(dfl, df2) 

0ut[39] : 

datal key data2 


Note that I didn’t specify which column to join on. If that information is not speci¬ 
fied, merge uses the overlapping column names as the keys. It’s a good practice to 
specify explicitly, though: 

In [40]: pd.merge(dfl, df2, on='key') 

Out[40]: 

datal key data2 
0 0 b 1 

1 lb 1 

2 6 b 1 

3 2 a 0 

4 4 a 0 

5 5 a 0 

If the column names are different in each object, you can specify them separately: 

In [41]: df3 = pd.DataFrame({ 'Ikey' : ['b 1 , 'b' , 'a 1 , 'c', 'a', 'a', ' b' ], 

....: 'datal': range(7)}) 

In [42]: df4 = pd.DataFrame({ 'rkey' : [’a 1 , 'b', 'd' ], 

....: 'data2': range(3)}) 

In [43]: pd.merge(df3, df4, left_on='lkey' , right_on=' rkey' ) 

0ut[43]: 

datal Ikey data2 rkey 
0 0 b lb 

1 lb lb 

2 6 b lb 

3 2 a 0 a 
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4 4 a 0 a 

5 5 a 0 a 

You may notice that the 1 c' and ' d' values and associated data are missing from the 
result. By default merge does an 'inner' join; the keys in the result are the intersec¬ 
tion, or the common set found in both tables. Other possible options are 'left', 
'right', and 'outer'. The outer join takes the union of the keys, combining the 
effect of applying both left and right joins: 

In [44]: pd.merge(dfl, df2, how='outer') 

0ut[44] : 



datal key 

data2 

0 

0.0 

b 

1.0 

1 

1.0 

b 

1.0 

2 

6.0 

b 

1.0 

3 

2.0 

a 

0.0 

4 

4.0 

a 

0.0 

5 

5.0 

a 

0.0 

6 

3.0 

c 

NaN 

7 

NaN 

d 

2.0 


See Table 8-1 for a summary of the options for how. 
Table 8-1. Different join types with how argument 


Option Behavior 


' inner' Use only the key combinations observed in both tables 

' left' Use all key combinations found in the left table 

' right' Use all key combinations found in the right table 

' output 1 Use all key combinations observed in both tables together 


Many-to-many merges have well-defined, though not necessarily intuitive, behavior. 
Here’s an example: 

In [45]: dfl = pd .DataFrame({ 'key' : ['b', 'b', 'a', ' c' , 'a 1 , 'b' ], 

....: 'datal': range(6)}) 

In [46]: df2 = pd .DataFrame({ 1 key' : [’a 1 , 1 b’ , 'a', ' b' , 'd'], 

....: 'data2': range(5)}) 

In [47]: dfl 
0ut[47] : 

datal key 
0 0 b 

1 lb 

2 2 a 

3 3 c 

4 4 a 

5 5 b 
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In [48]: df2 
0ut[48] : 

data2 key 
0 0 a 

1 lb 

2 2 a 

3 3 b 

4 4 d 


In [49]: pd.merge(dfl, df2, on='key', how='left') 
0ut[49] : 

datal key data2 
0 0 b 1.0 

1 0 b 3.0 

2 1 b 1.0 

3 1 b 3.0 

4 2 a 0.0 

5 2 a 2.0 

6 3 c NaN 

7 4 a 0.0 

8 4 a 2.0 

9 5 b 1.0 

10 5b 3.0 


Many-to-many joins form the Cartesian product of the rows. Since there were three 
1 b 1 rows in the left DataFrame and two in the right one, there are six 1 b 1 rows in the 
result. The join method only affects the distinct key values appearing in the result: 


In [50]: pd.merge(dfl, df2, how='inner') 
Out[50] : 

datal key data2 
0 0 b 1 

1 0 b 3 

2 lb 1 

3 lb 3 

4 5 b 1 

5 5 b 3 

6 2 a 0 

7 2 a 2 

8 4 a 0 

9 4 a 2 


To merge with multiple keys, pass a list of column names: 


In [51] 

left = pd.DataFrame({ 1 keyl 1 : 

[' foo' , 

'foo', 

bar’ ], 



'key2 1 : 

[ 'one' , 

'two', 

one’ ], 



'Ival' : 

[1, 2, 

3]}) 



In [52] 

right = pd.DataFrame({ 'keyl' 

: [' foo' 

, 1 f00' , 

' bar', 

'bar']. 


'key2' 

: ['one' 

, 'one'. 

'one', 

1 two' ], 


'rval' 

: [4, 5, 

6, 7]}) 




In [53]: pd.merge(left, right, on=['keyl', 'key2'], how='outer') 
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Out [ 53 ]: 



keyl key2 

Ival 

rval 

0 

foo 

one 

1.0 

4.0 

1 

foo 

one 

1.0 

5.0 

2 

foo 

two 

2.0 

NaN 

3 

bar 

one 

3.0 

6.0 

4 

bar 

two 

NaN 

7.0 


To determine which key combinations will appear in the result depending on the 
choice of merge method, think of the multiple keys as forming an array of tuples to 
be used as a single join key (even though it’s not actually implemented that way). 



When you’re joining columns-on-columns, the indexes on the 
passed DataFrame objects are discarded. 


A last issue to consider in merge operations is the treatment of overlapping column 
names. While you can address the overlap manually (see the earlier section on 
renaming axis labels), merge has a suffixes option for specifying strings to append 
to overlapping names in the left and right DataFrame objects: 


In [54]: pd.merge(teft, right, on='keyl') 
0ut[54] : 



keyl key2_x 

Ival key2_y 

rval 

0 

foo 

one 

1 

one 

4 

1 

foo 

one 

1 

one 

5 

2 

foo 

two 

2 

one 

4 

3 

foo 

two 

2 

one 

5 

4 

bar 

one 

3 

one 

6 

5 

bar 

one 

3 

two 

7 


In [55]: pd.merge(teft, right, on='keyl', 
0ut[55] : 



keyl key2_lett 

Ival key2_ 

.right 

rval 

0 

foo 

one 

1 

one 

4 

1 

foo 

one 

1 

one 

5 

2 

foo 

two 

2 

one 

4 

3 

foo 

two 

2 

one 

5 

4 

bar 

one 

3 

one 

6 

5 

bar 

one 

3 

two 

7 


suffixes=( '_left' , 


Aright' )) 


See Table 8-2 for an argument reference on merge. Joining using the DataFrame’s row 
index is the subject of the next section. 
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Table 8-2. merge function arguments 


Argument Description 


left 

right 

how 

DataFrame to be merged on the left side. 

DataFrame to be merged on the right side. 

One of 'inner', 'outer', 'left', or ' right'; defaults to 'inner'. 

on 

Column names to join on. Must be found in both DataFrame objects. If not specified and no other join keys 
given, will use the intersection of the column names in left and right as the join keys. 

left_on 

Columns in left DataFrame to use as join keys. 

rtght_on 

left_index 

Analogous to left_on for left DataFrame. 

Use row index in lef t as its join key (or keys, if a Multiindex). 


right_lndex Analogous to left_lndex. 


sort 

Sort merged data lexicographically by join keys; True by default (disable to get better performance in 
some cases on large datasets). 

suffixes 

Tuple of string values to append to column names in case of overlap; defaults to (' _x', ' _y') (e.g., if 
'data' in both DataFrame objects, would appear as 'data_x' and 'data_y' in result). 

copy 

If False, avoid copying data into resulting data structure in some exceptional cases; by default always 
copies. 

indicator 

Adds a special column _merge that indicates the source of each row; values will be ' lef t_only', 

' right_only', or ' both' based on the origin of the joined data in each row. 


Merging on Index 

In some cases, the merge key(s) in a DataFrame will be found in its index. In this 
case, you can pass left_index=True or right_index=True (or both) to indicate that 
the index should be used as the merge key: 

In [56]: leftl = pd.DataFrame({' key' : ['a', 'b', 'a', 'a', ' b' , 'c' ], 



'value' : range(6)}) 

In [57]: 

rightl = pd.DataFrame({' group_val' : [3.5, 7]}, index=['a', 'b' ]) 

In [58]: 

leftl 


0ut[58] : 
key value 


0 a 

1 b 

2 a 

3 a 

4 b 

5 c 

0 

1 

2 

3 

4 

5 

In [59]: 
Out[59] : 

rightl 


group_val 
a 3.5 
b 7.0 
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In [60]: pd.merge(leftl, rlghtl, left_on= 'key 1 , rtght_index=True) 

Out[60] : 

key value group_val 
0a 0 3.5 

2a 2 3.5 

3a 3 3.5 

lb 1 7.0 

4 b 4 7.0 

Since the default merge method is to intersect the join keys, you can instead form the 
union of them with an outer join: 

In [61]: pd.merge(leftl, rightl, left_on= 'key 1 , rlght_index=True, how='outer') 

0ut[61] : 

key value group_val 
0a 0 3.5 

2a 2 3.5 

3a 3 3.5 

lb 1 7.0 

4 b 4 7.0 

5c 5 NaN 

With hierarchically indexed data, things are more complicated, as joining on index is 
implicitly a multiple-key merge: 

In [62]: lefth = pd.DataFrane({' keyl' : ['Ohio', 'Ohio', 'Ohio', 

....: 'Nevada', 'Nevada'], 

_: 'key2' : [2000, 2001, 2002, 2001, 2002], 

....: 'data': np.arange(5. )}) 

In [63]: righth = pd.DataFrame(np. arange(12) . reshape((6, 2)), 

....: lndex=[[ 'Nevada' , 'Nevada', 'Ohio', 'Ohio', 

. . . . : 'Ohio' , 'Ohio' ], 

....: [2001, 2000, 2000, 2000, 2001, 2002]], 

....: columns=[ 1 eventl' , 'event2']) 


In [64] 

: lefth 


0ut[64] 



data 

keyl 

key2 

0 0.0 

Ohio 

2000 

1 1.0 

Ohio 

2001 

2 2.0 

Ohio 

2002 

3 3.0 

Nevada 

2001 

4 4.0 

Nevada 

2002 

In [65] 

: righth 


0ut[65] 




eventl i 

Nevada 

2001 

0 

2000 

2 

Ohio 

2000 

4 

2000 

6 
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2001 

2002 


8 

10 


9 

11 


In this case, you have to indicate multiple columns to merge on as a list (note the 
handling of duplicate index values with how= 1 outer'): 


In [66]: pd.merge(lefth, righth, left_on=[ 'keyl' , 
0ut[66] : 



data 

keyl 

key2 

eventl 

event2 

0 

0.0 

Ohio 

2000 

4 

5 

0 

0.0 

Ohio 

2000 

6 

7 

1 

1.0 

Ohio 

2001 

8 

9 

2 

2.0 

Ohio 

2002 

10 

11 

3 

3.0 

Nevada 

2001 

0 

1 

In [67]: 

0ut[67] : 

pd.merge(lefth 

right 

, righth, left_on=[' keyl' 
_index=True, how=' outer') 


data 

keyl 

key2 

eventl 

event2 

0 

0.0 

Ohio 

2000 

4.0 

5.0 

0 

0.0 

Ohio 

2000 

6.0 

7.0 

1 

1.0 

Ohio 

2001 

8.0 

9.0 

2 

2.0 

Ohio 

2002 

10.0 

11.0 

3 

3.0 

Nevada 

2001 

0.0 

1.0 

4 

4.0 

Nevada 

2002 

NaN 

NaN 

4 

NaN 

Nevada 

2000 

2.0 

3.0 


'key2'], right_index=True) 


'key2' ], 


Using the indexes of both sides of the merge is also possible: 

In [68]: left2 = pd.DataFrane( [[1 . , 2.], [3., 4.], [5., 6.]], 
....: index=['a', 'c', 'e' ], 

columns=[ 'Ohio' , 'Nevada']) 


In [69]: right2 = pd.DataFrame( [[7., 8.], [9., 10.], [11., 12.], [13, 14]], 
....: tndex=[’b', ’c’, ' d’ , 'e'], 

....: columns=[' Missouri' , 'Alabama']) 


In [70]: left2 
Out[70] : 

Ohio Nevada 
a 1.0 2.0 
c 3.0 4.0 
e 5.0 6.0 


In [71]: right2 
0ut[71] : 

Missouri Alabama 
b 7.0 8.0 
c 9.0 10.0 
d 11.0 12.0 
e 13.0 14.0 


In [72]: pd.merge(left2, right2, how='outer', left_index=True, right_index=True) 
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0ut[72] : 
Ohio 

Nevada 

Missouri 

Alabama 

a 1.0 

2.0 

NaN 

NaN 

b NaN 

NaN 

7.0 

8.0 

c 3.0 

4.0 

9.0 

10.0 

d NaN 

NaN 

11.0 

12.0 

e 5.0 

6.0 

13.0 

14.0 


DataFrame has a convenient join instance for merging by index. It can also be used 
to combine together many DataFrame objects having the same or similar indexes but 
non-overlapping columns. In the prior example, we could have written: 

In [73]: left2.jotn(right2, how='outer') 


0ut[73] : 
Ohio 

Nevada 

Missouri 

Alabama 

a 1.0 

2.0 

NaN 

NaN 

b NaN 

NaN 

7.0 

8.0 

c 3.0 

4.0 

9.0 

10.0 

d NaN 

NaN 

11.0 

12.0 

e 5.0 

6.0 

13.0 

14.0 


In part for legacy reasons (i.e., much earlier versions of pandas), DataFrame’s join 
method performs a left join on the join keys, exactly preserving the left frames row 
index. It also supports joining the index of the passed DataFrame on one of the col¬ 
umns of the calling DataFrame: 

In [74]: leftl.jotn(rightl, on='key') 

0ut[74] : 

key value group_val 
0 a 0 3.5 

lb 1 7.0 

2a 2 3.5 

3a 3 3.5 

4 b 4 7.0 

5c 5 NaN 

Lastly, for simple index-on-index merges, you can pass a list of DataFrames to join as 
an alternative to using the more general concat function described in the next 
section: 

In [75]: another = pd.DataFrame([[7. , 8], [9., 10.], [11., 12.], [16., 17.]], 
....: index=['a', 'c', 'e', ' f' ], 

....: columns=[ 'New York' , 'Oregon']) 


In [76]: another 
0ut[76] : 

New York Oregon 
a 7.0 8.0 
c 9.0 10.0 
e 11.0 12.0 
f 16.0 17.0 
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In 

[77]: 

left2.join( [right2. 

another] ) 



0ut[77] : 









Ohio 

Nevada 

Missouri 

Alabama 

New 

York 

Oregon 

a 

1.0 

2.0 

NaN 


NaN 


7.0 

8.0 

c 

3.0 

4.0 

9.0 


10.0 


9.0 

10.0 

e 

5.0 

6.0 

13.0 


14.0 


11.0 

12.0 

In 

00 

t'- 

left2.joint [right2. 

another], 1 

how=' outer 1 ) 

0ut[78] : 









Ohio 

Nevada 

Missouri 

Alabama 

New 

York 

Oregon 

a 

1.0 

2.0 

NaN 


NaN 


7.0 

8.0 

b 

NaN 

NaN 

7.0 


8.0 


NaN 

NaN 

c 

3.0 

4.0 

9.0 


10.0 


9.0 

10.0 

d 

NaN 

NaN 

11.0 


12.0 


NaN 

NaN 

e 

5.0 

6.0 

13.0 


14.0 


11.0 

12.0 

f 

NaN 

NaN 

NaN 


NaN 


16.0 

17.0 


Concatenating Along an Axis 

Another kind of data combination operation is referred to interchangeably as concat¬ 
enation, binding, or stacking. NumPy’s concatenate function can do this with 
NumPy arrays: 

In [79]: arr = np.arange(12).reshape((3, 4)) 

In [80]: arr 
Out[80] : 

array([[ 0, 1, 2, 3], 

[ 4, 5, 6, 7], 

[ 8, 9, 10, 11]]) 

In [81]: np.concatenate([arr, arr], axts=l) 


0ut[81] : 


arrayt [[ 0, 

1, 

2, 

3, 

0, 

1, 

2, 

3], 

[ 4, 

5, 

6, 

7, 

4, 

5, 

6, 

7], 

[ 8, 

9 , 

10, 

11, 

8, 

9 , 

10, 

11]]) 


In the context of pandas objects such as Series and DataFrame, having labeled axes 
enable you to further generalize array concatenation. In particular, you have a num¬ 
ber of additional things to think about: 

• If the objects are indexed differently on the other axes, should we combine the 
distinct elements in these axes or use only the shared values (the intersection)? 

• Do the concatenated chunks of data need to be identifiable in the resulting 
object? 

• Does the “concatenation axis” contain data that needs to be preserved? In many 
cases, the default integer labels in a DataFrame are best discarded during 
concatenation. 
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The concat function in pandas provides a consistent way to address each of these 
concerns. I’ll give a number of examples to illustrate how it works. Suppose we have 
three Series with no index overlap: 

In [82]: si = pd.Series( [0, 1], index=['a', 'b' ]) 

In [83]: s2 = pd.Series( [2, 3, 4], index=['c', ' d' , 'e' ]) 

In [84]: s3 = pd.Series( [5, 6], index=['f', 'g' ]) 

Calling concat with these objects in a list glues together the values and indexes: 

In [85]: pd.concat([sl, s2, s3]) 

0ut[85] : 
a 0 
b 1 
c 2 
d 3 
e 4 
f 5 
9 6 

dtype: tnt64 

By default concat works along axis=0, producing another Series. If you pass axis=l, 
the result will instead be a DataFrame (axis=l is the columns): 

In [86]: pd.concat([sl, s2, s3], axis=l) 


0ut[86] 

0 

a 0.0 

1 

NaN 

2 

NaN 

b 

1.0 

NaN 

NaN 

c 

NaN 

2.0 

NaN 

d 

NaN 

3.0 

NaN 

e 

NaN 

4.0 

NaN 

f 

NaN 

NaN 

5.0 

9 

NaN 

NaN 

6.0 


In this case there is no overlap on the other axis, which as you can see is the sorted 
union (the 'outer' join) of the indexes. You can instead intersect them by passing 
join='inner': 

In [87]: s4 = pd.concat( [si, s3]) 

In [88]: s4 
0ut[88] : 
a 0 
b 1 
f 5 
9 6 

dtype: int64 

In [89]: pd.concat([sl, s4], axts=l) 

0ut[89] : 
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0 1 
a 0.0 0 

b 1.0 1 

f NaN 5 
g NaN 6 

In [90]: pd.concat([sl, s4], axls=l, join= 1 inner' ) 

Out[90] : 

0 1 
a 0 0 

b 1 1 

In this last example, the 1 f' and ' g' labels disappeared because of the join=' inner' 
option. 

You can even specify the axes to be used on the other axes with join_axes: 

In [91]: pd.concat([sl, s4], axls=l, join_axes: [ [' a' , 'c', 'b', 'e']]) 

0ut[91] : 

0 1 
a 0.0 0.0 

c NaN NaN 
b 1.0 1.0 

e NaN NaN 

A potential issue is that the concatenated pieces are not identifiable in the result. Sup¬ 
pose instead you wanted to create a hierarchical index on the concatenation axis. To 
do this, use the keys argument: 

In [92]: result = pd,concat( [si, si, s3], keys: ['one 1 , 'two', 'three']) 

In [93]: result 
0ut[93] : 
one a 0 

b 1 
two a 0 

b 1 
three f 5 

9 6 

dtype: tnt64 


In [94]: result.unstack( ) 
0ut[94] : 



a 

b 

f 

9 

one 

0.0 

1.0 

NaN 

NaN 

two 

0.0 

1.0 

NaN 

NaN 

three 

NaN 

NaN 

5.0 

6.0 


In the case of combining Series along axis=l, the keys become the DataFrame col¬ 
umn headers: 

In [95]: pd.concat([sl, s2, s3], axls=l, keys=['one', 'two', 'three']) 

0ut[95] : 
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one 

two 

three 

a 

0.0 

NaN 

NaN 

b 

1.0 

NaN 

NaN 

c 

NaN 

2.0 

NaN 

d 

NaN 

3.0 

NaN 

e 

NaN 

4.0 

NaN 

f 

NaN 

NaN 

5.0 

g 

NaN 

NaN 

6.0 


The same logic extends to DataFrame objects: 

In [96]: dfl = pd.DataFrame(np.arange(6).reshape(3, 2), index=['a', ' b' , 'c' ], 

columns=[ 'one' , 'two']) 

In [97]: df2 = pd.DataFrame(5 + np.arange(4) . reshape(2, 2), tndex=['a', 'c'], 

columns=[ 'three' , 'four']) 

In [98]: dfl 
0ut[98] : 

one two 
a 0 1 

b 2 3 

c 4 5 

In [99]: df2 

0ut[99] : 

three four 
a 5 6 

c 7 8 

In [100]: pd.concat([dfl, df2], axls=l, keys=[' levell' , 'level2']) 

Out[100] : 

levell level2 

one two three four 
a 0 1 5.0 6.0 

b 2 3 NaN NaN 

c 4 5 7.0 8.0 

If you pass a diet of objects instead of a list, the diet’s keys will be used for the keys 
option: 

In [101]: pd.concat({ 'levell' : dfl, 'level2': df2}, axis=l) 

Out[101] : 

levell level2 

one two three four 
a 0 1 5.0 6.0 

b 2 3 NaN NaN 

c 4 5 7.0 8.0 

There are additional arguments governing how the hierarchical index is created (see 
Table 8-3). For example, we can name the created axis levels with the names 
argument: 
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In [102]: pd.concat([dfl, df2], axls=l, keys=[' levell 1 , 'level2'], 

.: names=['upper', ’lower 1 ]) 

Out[102] : 

upper levell level2 
lower one two three four 

a 0 1 5.0 6.0 

b 2 3 NaN NaN 

c 4 5 7.0 8.0 

A last consideration concerns DataFrames in which the row index does not contain 
any relevant data: 

In [103]: dfl = pd.DataFramefnp.random.randn(3, 4), columns ['a', 'b', 'c', 'd' ]) 

In [104]: df2 = pd.DataFrame(np.random.randn(2, 3), columns=[ 1 b 1 , ' d' , 'a']) 

In [105]: dfl 
Out[105] : 

a b 

0 1.246435 1.007189 

1 0.228913 1.352917 

2 -0.371843 1.669025 

In [106]: df2 
Out[106] : 

b d a 

0 0.476985 3.248944 -1.021228 

1 -0.577087 0.124121 0.302614 

In this case, you can pass ignore_Lndex=True: 

In [107]: pd.concat([dfl, df2], ignore_index=True) 

Out[107] : 

abed 
0 1.246435 1.007189 -1.296221 0.274992 

1 0.228913 1.352917 0.886429 -2.001637 

2 -0.371843 1.669025 -0.438570 -0.539741 

3 -1.021228 0.476985 NaN 3.248944 

4 0.302614 -0.577087 NaN 0.124121 


c d 
-1.296221 0.274992 
0.886429 -2.001637 
-0.438570 -0.539741 


Table 8-3. concatfunction arguments 


Argument Description 


ob j s List or diet of pandas objects to be concatenated; this is the only required argument 

axis Axis to concatenate along; defaults to 0 (along rows) 


join 

join_axes 

keys 


Either 'inner' or 'outer' ('outer' by default); whether to intersection (inner) or union 
(outer) together indexes along the other axes 

Specific indexes to use for the other n- 1 axes instead of performing union/intersection logic 
Values to associate with objects being concatenated, forming a hierarchical index along the 
concatenation axis; can either be a list or array of arbitrary values, an array of tuples, or a list of 
arrays (if multiple-level arrays passed in levels) 
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1 Argument 

Description || 

levels 

Specific indexes to use as hierarchical index level or levels if keys passed 

names 

Names for created hierarchical levels if keys and/or levels passed 

verify_lntegrity 

Check new axis in concatenated object for duplicates and raise exception if so; by default (False) 
allows duplicates 

lgnore_index 

Do not preserve indexes along concatenation axis, instead producing a new 
range(total_length) index 


Combining Data with Overlap 

There is another data combination situation that can’t be expressed as either a merge 
or concatenation operation. You may have two datasets whose indexes overlap in full 
or part. As a motivating example, consider NumPy’s where function, which performs 
the array-oriented equivalent of an if-else expression: 

In [108]: a = pd.Series([np.nan, 2.5, np.nan, 3.5, 4.5, np.nan], 

.: lndex=['f', 'e', 'd' , 'c', 'b' , 'a']) 

In [109]: b = pd.Series(np.arange(ten(a), dtype=np.float64) , 

.: lndex=['f', 'e', 'd' , 'c', 'b' , 'a']) 

In [110]: b[-l] = np.nan 

In [111]: a 
Out[lll] : 
f NaN 
e 2.5 
d NaN 
c 3.5 
b 4.5 
a NaN 
dtype: float64 

In [112]: b 
0ut[112] : 
f 0.0 
e 1.0 
d 2.0 
c 3.0 
b 4.0 
a NaN 
dtype: float64 

In [113]: np.where(pd.isnult(a), b, a) 

0ut[113] : array( [ 0. , 2.5, 2. , 3.5, 4.5, nan]) 

Series has a combi.ne_fi.rst method, which performs the equivalent of this operation 
along with pandas’s usual data alignment logic: 
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In [114]: b[ :-2]. combine_flrst(a[2 : ]) 

0ut[114] : 
a NaN 
b 4.5 
c 3.0 
d 2.0 
e 1.0 
f 0.0 
dtype: float64 

With DataFrames, combine_first does the same thing column by column, so you 
can think of it as “patching” missing data in the calling object with data from the 
object you pass: 

In [115]: dfl = pd.DataFrame({ 'a 1 : [1., np.nan, 5., np.nan], 

.: 'b 1 : [np.nan, 2., np.nan, 6.], 

.: 'c' : range(2, 18, 4)}) 

In [116]: df2 = pd.DataFrame({ 'a 1 : [5., 4., np.nan, 3., 7.], 

.: 'b' : [np.nan, 3., 4., 6., 8.]}) 


In [117]: dfl 
0ut[117] : 

a be 

0 1.0 NaN 2 

1 NaN 2.0 6 

2 5.0 NaN 10 

3 NaN 6.0 14 

In [118]: df2 
0ut[118] : 

a b 

0 5.0 NaN 

1 4.0 3.0 

2 NaN 4.0 

3 3.0 6.0 

4 7.0 8.0 

In [119]: dfl.combine_first(df2) 
0ut[119] : 

a b c 

0 1.0 NaN 2.0 

1 4.0 2.0 6.0 

2 5.0 4.0 10.0 

3 3.0 6.0 14.0 

4 7.0 8.0 NaN 


8.3 Reshaping and Pivoting 

There are a number of basic operations for rearranging tabular data. These are alter- 
natingly referred to as reshape or pivot operations. 
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Reshaping with Hierarchical Indexing 

Hierarchical indexing provides a consistent way to rearrange data in a DataFrame. 

There are two primary actions: 

stack 

This “rotates” or pivots from the columns in the data to the rows 

unstack 

This pivots from the rows into the columns 

I’ll illustrate these operations through a series of examples. Consider a small Data¬ 
Frame with string arrays as row and column indexes: 

In [120]: data = pd.DataFrane(np.arange(6) . reshape( (2, 3)), 

.: index=pd.Index(['Ohio', 'Colorado'], name='state'), 

.: columns=pd.Index(['one', 'two', 'three'], 

.: name='number')) 


In [121]: data 
0ut[121] : 

number one two three 
state 

Ohio 01 2 

Colorado 34 5 

Using the stack method on this data pivots the columns into the rows, producing a 
Series: 


In [122]: result = data.stackQ 

In [123]: result 
0ut[123]: 
state number 


Ohio one 0 

two 1 

three 2 

Colorado one 3 

two 4 

three 5 

dtype: tnt64 


From a hierarchically indexed Series, you can rearrange the data back into a Data¬ 
Frame with unstack: 


In [124]: result.unstack( ) 
0ut[124] : 

number one two three 
state 

Ohio 01 2 

Colorado 34 5 
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By default the innermost level is unstacked (same with stack). You can unstack a dif¬ 
ferent level by passing a level number or name: 

In [125]: result.unstack(0) 

Out[ 125] : 

state Ohio Colorado 
number 

one 0 3 

two 1 4 

three 2 5 


In [126]: result.unstack( 1 state' ) 

0ut[126] : 

state Ohio Colorado 

number 

one 0 3 

two 1 4 

three 2 5 

Unstacking might introduce missing data if all of the values in the level aren’t found 
in each of the subgroups: 

In [127]: si = pd.Serles( [0, 1, 2, 3], index=['a', 'b' , 'c', 'd' ]) 

In [128]: s2 = pd.Serles( [4, 5, 6 ], lndex=['c', ' d 1 , ' e' ] ) 

In [129]: data2 = pd,concat( [si, s2], keys=['one', 'two']) 


In [130]: data2 
Out[130] : 
one a 0 

b 1 

c 2 

d 3 

two c 4 

d 5 

e 6 

dtype: lnt64 


In [131]: data2.unstackQ 
0ut[131]: 

a b c d e 

one 0.0 1.0 2.0 3.0 NaN 

two NaN NaN 4.0 5.0 6.0 

Stacking filters out missing data by default, so the operation is more easily invertible: 

In [132]: data2.unstackQ 
0ut[132]: 

a b c d e 

one 0.0 1.0 2.0 3.0 NaN 

two NaN NaN 4.0 5.0 6.0 
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In [133]: data2.unstack() .stackQ 
Out[133]: 
one a 0.0 

b 1.0 

c 2.0 

d 3.0 

two c 4.0 

d 5.0 

e 6.0 

dtype: float64 


In [134]: data2.unstack( ). stack(dropna=False) 

0ut[134] : 
one a 0.0 

b 1.0 
c 2.0 
d 3.0 
e NaN 
two a NaN 

b NaN 
c 4.0 
d 5.0 
e 6.0 
dtype: float64 

When you unstack in a DataFrame, the level unstacked becomes the lowest level in 
the result: 


In [135]: df = pd,DataFrame({ 'left' : result, 'right': result + 5}, 

.: columns=pd.Index(['left', 'right'], name='side')) 


In [136]: df 
0ut[136]: 

side left 

state number 

Ohio one 0 

two 1 

three 2 

Colorado one 3 

two 4 

three 5 


right 

5 

6 

7 

8 
9 

10 


In [137]: df.unstack( 1 state' ) 

0ut[137]: 

side left right 

state Ohio Colorado Ohio Colorado 

number 

one 0 3 5 8 

two 1 4 6 9 

three 2 5 7 10 

When calling stack, we can indicate the name of the axis to stack: 
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In [138]: df.unstack( ': 

state' ) ,stack( 'side' ) 

0ut[138]: 
state 

Colorado 

Ohio 

number 

one 

side 

left 

3 

0 


right 

8 

5 

two 

left 

4 

1 


right 

9 

6 

three 

left 

5 

2 


right 

10 

7 


Pivoting "Long" to "Wide" Format 

A common way to store multiple time series in databases and CSV is in so-called long 
or stacked format. Let’s load some example data and do a small amount of time series 
wrangling and other data cleaning: 


In [139]: data = pd.read_csv( 'examples/macrodata.csv' ) 


In [140]: data.headQ 
Out [140] : 



year 

quarter 

realgdp realcons 

realinv i 

realgovt 

realdpi 

cpi 

0 

1959.0 

1.0 

2710. 

349 1707.4 

286.898 

470.045 

1886.9 

28.98 

1 

1959.0 

2.0 

2778. 

801 1733.7 

310.859 

481.301 

1919.7 

29.15 

2 

1959.0 

3.0 

2775. 

488 1751.8 

289.226 

491.260 

1916.4 

29.35 

3 

1959.0 

4.0 

2785. 

204 1753.7 

299.356 

484.052 

1931.3 

29.37 

4 

1960.0 

1.0 

2847. 

699 1770.5 

331.722 

462.199 

1955.5 

29.54 


ml 

tbilrate 

unenp 

pop infl 

realint 




0 

139.7 

2.82 

5.8 

177.146 0.00 

0.00 




1 

141.7 

3.08 

5.1 

177.830 2.34 

0.74 




2 

140.5 

3.82 

5.3 

178.657 2.74 

1.09 




3 

140.0 

4.33 

5.6 

179.386 0.27 

4.06 




4 

139.6 

3.50 

5.2 

180.007 2.31 

1.19 




In 

[141]: 

periods = 

pd.PeriodIndex(year= 

data.year 

, quarter: 

=data.quarter. 






'date') 

'infl' , 




In 

[142]: 

columns = 

pd.Index( [ 'realgdp' , 

'unemp' ], 

name= 'item' ) 


In [143]: data = data.reindex(columns=columns) 


In [144]: data.index = periods.to_timestamp( 'D 1 , 'end') 


In [145]: tdata = data.stack( ). reset_index( ). rename(columns={0: 'value'}) 

We will look at Periodlndex a bit more closely in Chapter 11. In short, it combines 
the year and quarter columns to create a kind of time interval type. 


Now, Idata looks like: 


In [146]: ldata[:10] 
0ut[146]: 
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date 

iten 

value 

0 

1959-03-31 

realgdp 

2710.349 

1 

1959-03-31 

infl 

0.000 

2 

1959-03-31 

unenp 

5.800 

3 

1959-06-30 

realgdp 

2778.801 

4 

1959-06-30 

infl 

2.340 

5 

1959-06-30 

unenp 

5.100 

6 

1959-09-30 

realgdp 

2775.488 

7 

1959-09-30 

infl 

2.740 

8 

1959-09-30 

unenp 

5.300 

9 

1959-12-31 

realgdp 

2785.204 


This is the so-called long format for multiple time series, or other observational data 
with two or more keys (here, our keys are date and item). Each row in the table repre¬ 
sents a single observation. 

Data is frequently stored this way in relational databases like MySQL, as a fixed 
schema (column names and data types) allows the number of distinct values in the 
iten column to change as data is added to the table. In the previous example, date 
and iten would usually be the primary keys (in relational database parlance), offering 
both relational integrity and easier joins. In some cases, the data may be more diffi¬ 
cult to work with in this format; you might prefer to have a DataFrame containing 
one column per distinct iten value indexed by timestamps in the date column. Data- 
Frame’s pivot method performs exactly this transformation: 

In [147]: pivoted = ldata.pivot( 'date' , 'iten', 'value') 


In [148]: pivoted 
0ut[148] : 


iten 

date 

infl 

realgdp 

unenp 

1959-03-31 

0.00 

2710.349 

5.8 

1959-06-30 

2.34 

2778.801 

5.1 

1959-09-30 

2.74 

2775.488 

5.3 

1959-12-31 

0.27 

2785.204 

5.6 

1960-03-31 

2.31 

2847.699 

5.2 

1960-06-30 

0.14 

2834.390 

5.2 

1960-09-30 

2.70 

2839.022 

5.6 

1960-12-31 

1.21 

2802.616 

6.3 

1961-03-31 

-0.40 

2819.264 

6.8 

1961-06-30 

1.47 

2872.005 

7.0 

2007-06-30 

2.75 

13203.977 

4.5 

2007-09-30 

3.45 

13321.109 

4.7 

2007-12-31 

6.38 

13391.249 

4.8 

2008-03-31 

2.82 

13366.865 

4.9 

2008-06-30 

8.53 

13415.266 

5.4 

2008-09-30 

-3.16 

13324.600 

6.0 

2008-12-31 

-8.79 

13141.920 

6.9 

2009-03-31 

0.94 

12925.410 

8.1 

2009-06-30 

3.37 

12901.504 

9.2 
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2009-09-30 3.56 12990.341 9.6 

[203 rows x 3 columns] 

The first two values passed are the columns to be used respectively as the row and 
column index, then finally an optional value column to fill the DataFrame. Suppose 
you had two value columns that you wanted to reshape simultaneously: 

In [149]: ldata[ 1 value2' ] = np.random.randn(len(ldata)) 


In [150]: ldata[ : 10] 
Out[150] : 



date 

item 

value value2 

0 

1959-03-31 

realgdp 

2710.349 0.523772 

1 

1959-03-31 

infl 

0.000 0.000940 

2 

1959-03-31 

unemp 

5.800 1.343810 

3 

1959-06-30 

realgdp 

2778.801 -0.713544 

4 

1959-06-30 

infl 

2.340 -0.831154 

5 

1959-06-30 

unemp 

5.100 -2.370232 

6 

1959-09-30 

realgdp 

2775.488 -1.860761 

7 

1959-09-30 

infl 

2.740 -0.860757 

8 

1959-09-30 

unemp 

5.300 0.560145 

9 

1959-12-31 

realgdp 

2785.204 -1.265934 


By omitting the last argument, you obtain a DataFrame with hierarchical columns: 

In [151]: pivoted = ldata.pivot( 'date' , 'item') 


In [152]: pivoted[:5] 
0ut[152]: 



value 


value2 



item 

infl 

realgdp unemp infl 

realgdp 

unemp 

date 

1959-03-31 

0.00 

2710.349 

5.8 0.000940 

0.523772 

1.343810 

1959-06-30 

2.34 

2778.801 

5.1 -0.831154 

-0.713544 

-2.370232 

1959-09-30 

2.74 

2775.488 

5.3 -0.860757 

-1.860761 

0.560145 

1959-12-31 

0.27 

2785.204 

5.6 0.119827 

-1.265934 

-1.063512 

1960-03-31 

2.31 

2847.699 

5.2 -2.359419 

0.332883 

-0.199543 

In [153]: pivoted[ 

1 value' ][ 

:5] 



0ut[153]: 
item 

date 

infl 

realgdp 

unemp 



1959-03-31 

0.00 

2710.349 

5.8 



1959-06-30 

2.34 

2778.801 

5.1 



1959-09-30 

2.74 

2775.488 

5.3 



1959-12-31 

0.27 

2785.204 

5.6 



1960-03-31 

2.31 

2847.699 

5.2 




Note that pivot is equivalent to creating a hierarchical index using set_index fol¬ 
lowed by a call to unstack: 
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In [154]: unstacked = Idata.set_index( [ 1 date' , ' Item ']). unstack(' Item 1 ) 


In [155]: unstackedf :7] 
Out[155]: 


tten 

vatue 

inft 

reatgdp unenp 

vatue2 

tnft 

reatgdp 

unenp 

date 

1959-03-31 

0.00 

2710.349 

5.8 

0.000940 

0.523772 

1.343810 

1959-06-30 

2.34 

2778.801 

5.1 

-0.831154 

-0.713544 

-2.370232 

1959-09-30 

2.74 

2775.488 

5.3 

-0.860757 

-1.860761 

0.560145 

1959-12-31 

0.27 

2785.204 

5.6 

0.119827 

-1.265934 

-1.063512 

1960-03-31 

2.31 

2847.699 

5.2 

-2.359419 

0.332883 

-0.199543 

1960-06-30 

0.14 

2834.390 

5.2 

-0.970736 

-1.541996 

-1.307030 

1960-09-30 

2.70 

2839.022 

5.6 

0.377984 

0.286350 

-0.753887 


Pivoting "Wide" to "Long" Format 

An inverse operation to pivot for DataFrames is pandas.melt. Rather than trans¬ 
forming one column into many in a new DataFrame, it merges multiple columns into 
one, producing a DataFrame that is longer than the input. Let’s look at an example: 

In [157]: df = pd .DataFrane({ 1 key' : ['foo', 'bar', 'baz'], 

.: 'A': [1, 2, 3], 

.: 'B' : [4, 5, 6], 

.: 'C' : [7, 8, 9]}) 


In [158]: df 
0ut[158] : 

A B C key 

0 1 4 7 foo 

1 2 5 8 bar 

2 3 6 9 baz 

The ' key 1 column may be a group indicator, and the other columns are data values. 
When using pandas. melt, we must indicate which columns (if any) are group indica¬ 
tors. Let’s use ' key' as the only group indicator here: 

In [159]: netted = pd.melt(df, ['key']) 

In [160]: netted 

0ut[160] : 

key vartabte vatue 
0 foo A 1 

1 bar A 2 

2 baz A 3 

3 foo B 4 

4 bar B 5 

5 baz B 6 

6 foo C 7 

7 bar C 8 

8 baz C 9 
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Using pivot, we can reshape back to the original layout: 

In [161]: reshaped = melted.plvot(' key' , 'variable', 'value') 

In [162]: reshaped 
0ut[162] : 
variable ABC 
key 

bar 258 

baz 3 6 9 

foo 14 7 

Since the result of pivot creates an index from the column used as the row labels, we 
may want to use reset_index to move the data back into a column: 

In [163]: reshaped.reset_index( ) 

0ut[163] : 

variable key ABC 
0 bar 2 5 8 

1 baz 3 6 9 

2 foo 1 4 7 

You can also specify a subset of columns to use as value columns: 

In [164]: pd.melt(df, id_vars=[' key' ], value_vars=[ 'A' , 'B' ]) 

0ut[164] : 

key variable value 
0 foo A 1 

1 bar A 2 

2 baz A 3 

3 foo B 4 

4 bar B 5 

5 baz B 6 

pandas. melt can be used without any group identifiers, too: 

In [165]: pd.melt(df, value_vars=[' A' , ' B' , ' C'] ) 

0ut[165] : 

variable value 
0 A 1 

1 A 2 

2 A3 

3 B 4 

4 B 5 

5 B 6 

6 C 7 

7 C 8 

8 C 9 

In [166]: pd.melt(df, value_vars=[' key' , 'A', ' B' ] ) 

0ut[166] : 

variable value 
0 key foo 

1 key bar 
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2 key baz 

3 A 1 

4 A 2 

5 A3 

6 B 4 

7 B 5 

8 B 6 

8.4 Conclusion 

Now that you have some pandas basics for data import, cleaning, and reorganization 
under your belt, we are ready to move on to data visualization with matplotlib. We 
will return to pandas later in the book when we discuss more advanced analytics. 
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CHAPTER 9 


Plotting and Visualization 


Making informative visualizations (sometimes called plots) is one of the most impor¬ 
tant tasks in data analysis. It may be a part of the exploratory process—for example, 
to help identify outliers or needed data transformations, or as a way of generating 
ideas for models. For others, building an interactive visualization for the web may be 
the end goal. Python has many add-on libraries for making static or dynamic visuali¬ 
zations, but I’ll be mainly focused on matplotlib and libraries that build on top of it. 

matplotlib is a desktop plotting package designed for creating (mostly two- 
dimensional) publication-quality plots. The project was started by John Hunter in 
2002 to enable a MATLAB-like plotting interface in Python. The matplotlib and IPy- 
thon communities have collaborated to simplify interactive plotting from the IPython 
shell (and now, Jupyter notebook), matplotlib supports various GUI backends on all 
operating systems and additionally can export visualizations to all of the common 
vector and raster graphics formats (PDF, SVG, JPG, PNG, BMP, GIF, etc.). With the 
exception of a few diagrams, nearly all of the graphics in this book were produced 
using matplotlib. 

Over time, matplotlib has spawned a number of add-on toolkits for data visualization 
that use matplotlib for their underlying plotting. One of these is seaborn, which we 
explore later in this chapter. 

The simplest way to follow the code examples in the chapter is to use interactive plot¬ 
ting in the Jupyter notebook. To set this up, execute the following statement in a 
Jupyter notebook: 

%matplotlib notebook 

9.1 A Brief matplotlib API Primer 

With matplotlib, we use the following import convention: 
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In [11]: import as 

After running %piatplotlib notebook in Jupyter (or simply %matplotlib in IPy- 
thon), we can try creating a simple plot. If everything is set up right, a line plot like 
Figure 9-1 should appear: 

In [12]: import as 

In [13]: data = np.arange(10) 

In [14]: data 

0ut[14] : array( [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]) 

In [15]: ptt.plot(data) 



While libraries like seaborn and pandas’s built-in plotting functions will deal with 
many of the mundane details of making plots, should you wish to customize them 
beyond the function options provided, you will need to learn a bit about the matplot- 
lib API. 



There is not enough room in the book to give a comprehensive 
treatment to the breadth and depth of functionality in matplotlib. It 
should be enough to teach you the ropes to get up and running. 
The matplotlib gallery and documentation are the best resource for 
learning advanced features. 
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Figures and Subplots 

Plots in matplotlib reside within a Figure object. You can create a new figure with 
pit. figure: 

In [16]: fig = plt.figureQ 

In IPython, an empty plot window will appear, but in Jupyter nothing will be shown 
until we use a few more commands, pit.figure has a number of options; notably, 
figsize will guarantee the figure has a certain size and aspect ratio if saved to disk. 

You can’t make a plot with a blank figure. You have to create one or more subplots 
using add_subplot: 

In [17]: axl = fig.add_subplot(2, 2, 1) 

This means that the figure should be 2 x 2 (so up to four plots in total), and we’re 
selecting the first of four subplots (numbered from 1). If you create the next two sub¬ 
plots, you’ll end up with a visualization that looks like Figure 9-2: 

In [18]: ax2 = fig.add_subplot(2, 2, 2) 

In [19]: ax3 = fig.add_subplot(2, 2, 3) 

1.0 -1 - 1.0 -1 - 

0.8 ■ 0.8 ■ 

0.6 - 0.6 - 

0.4 - 0.4 ■ 

0.2 ■ 0.2 ■ 

0.0 -I-1-1-1-1- 0.0 -I-1-1-1-1- 

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 

L 0 -1 - 

0.8 ■ 

0.6 - 

0.4 - 

0.2 - 

0.0 1 - 1 - 1 -.-.- 

0.0 0.2 0.4 0.6 0.8 1.0 

Figure 9-2. An empty matplotlib figure with three subplots 
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One nuance of using Jupyter notebooks is that plots are reset after 
each cell is evaluated, so for more complex plots you must put all of 
the plotting commands in a single notebook cell. 


Here we run all of these commands in the same cell: 


fig = plt.figureQ 
axl = fig.add_subplot(2 , 2, 1) 
ax2 = fig.add_subplot(2 , 2, 2) 
ax3 = fig.add_subplot(2 , 2, 3) 

When you issue a plotting command like pit.plot([1.5, 3.5, -2, 1.6]), mat- 
plotlib draws on the last figure and subplot used (creating one if necessary), thus hid¬ 
ing the figure and subplot creation. So if we add the following command, you’ll get 
something like Figure 9-3: 


In [26]: pit.plot(np.random.randn(50).cumsum( ), 'k- -' ) 



1.0 ■ 



0.8 ■ 



0.6 ■ 



0.4 ■ 



0.2 ■ 

0.0 - 



) H-i-1-1-1- 0.0 -1-1-i-1- 

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 



Figure 9-3. Data visualization after single plot 


The ' k- -' is a style option instructing matplotlib to plot a black dashed line. The 
objects returned by fig.add_subplot here are AxesSubplot objects, on which you 
can directly plot on the other empty subplots by calling each one’s instance method 
(see Figure 9-4): 
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In [21]: _ = axl.hist(np.random. randn(100) , bins=20, color='k', alpha=0.3) 

In [22]: ax2.scatter(np.arange(30) , np.arange(30) + 3 * np.random. randn(30)) 

30 
25 
20 
15 
10 
5 
0 

-2 -1 6 1 2 0 5 10 15 20 25 30 

10.0 

7.5 
5.0 

2.5 
0.0 
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Figure 9-4. Data visualization after additional plots 
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You can find a comprehensive catalog of plot types in the matplotlib documentation. 

Creating a figure with a grid of subplots is a very common task, so matplotlib 
includes a convenience method, pit.subplots, that creates a new figure and returns 
a NumPy array containing the created subplot objects: 

In [24]: fig, axes = plt.subplots(2, 3) 

In [25]: axes 

0ut[25] : 

array([[<matplotlib.axes._subplots.AxesSubplot object at 0x7fb626374048>, 
<matplotlib.axes._subplots.AxesSubplot object at 0x7fb62625db00>, 
<matplotlib.axes._subplots.AxesSubplot object at 0x7fb6262f6c88>] , 
[<matplotlib.axes._subplots.AxesSubplot object at 0x7fb6261a36a0>, 
<matplotlib.axes._subplots.AxesSubplot object at 0x7fb626181860>, 
<matplotlib.axes._subplots.AxesSubplot object at 0x7fb6260fd4e0>] ], dtype 

=object) 

This is very useful, as the axes array can be easily indexed like a two-dimensional 
array; for example, axes[0, 1], You can also indicate that subplots should have the 
same x- or y-axis using sharex and sharey, respectively. This is especially useful 
when you’re comparing data on the same scale; otherwise, matplotlib autoscales plot 
limits independently. See Table 9-1 for more on this method. 
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Table 9-1. pyplot.subplots options 


Argument Description 


nrows Number of rows of subplots 

ncols Number of columns of subplots 

sha rex All subplots should use the same x-axis ticks (adjusting the xlim will affect all subplots) 

sha rey All subplots should use the same y-axis ticks (adjusting the ylim will affect all subplots) 

subplot_kw Diet of keywords passed to add_subplot call used to create each subplot 
**f ig_kw Additional keywords to subplots are used when creating the figure, such as pit. subplots (2, 2, 

figslze=(8, 6)) 


Adjusting the spacing around subplots 

By default matplotlib leaves a certain amount of padding around the outside of the 
subplots and spacing between subplots. This spacing is all specified relative to the 
height and width of the plot, so that if you resize the plot either programmatically or 
manually using the GUI window, the plot will dynamically adjust itself. You can 
change the spacing using the subplots_adjust method on Figure objects, also avail¬ 
able as a top-level function: 

subplots_adjust(left=None, bottom=None, right=None, top=None, 
wspace=None, hspace=None) 

wspace and hspace controls the percent of the figure width and figure height, respec¬ 
tively, to use as spacing between subplots. Here is a small example where I shrink the 
spacing all the way to zero (see Figure 9-5): 

fig, axes = pit.subplots(2, 2, sharex=True, sharey=True) 
for l in range(2): 

for j in range(2): 

axes[i, j] .hist(np.random.randn(500), bins=50, color='k', alpha=0.5) 
pit.subplots_adjust(wspace=0, hspace=0) 
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Figure 9-5. Data visualization with no inter-subplot spacing 


You may notice that the axis labels overlap, matplotlib doesn’t check whether the 
labels overlap, so in a case like this you would need to fix the labels yourself by speci¬ 
fying explicit tick locations and tick labels (we’ll look at how to do this in the follow¬ 
ing sections). 

Colors, Markers, and Line Styles 

Matplotlib’s main plot function accepts arrays of x and y coordinates and optionally a 
string abbreviation indicating color and line style. For example, to plot x versus y 
with green dashes, you would execute: 

ax.plot(x, y, 'g--' ) 

This way of specifying both color and line style in a string is provided as a conve¬ 
nience; in practice if you were creating plots programmatically you might prefer not 
to have to munge strings together to create plots with the desired style. The same plot 
could also have been expressed more explicitly as: 

ax.plot(x, y, linestyle= '- -' , color='g') 

There are a number of color abbreviations provided for commonly used colors, but 
you can use any color on the spectrum by specifying its hex code (e.g., '#CECECE 1 ). 
You can see the full set of line styles by looking at the docstring for plot (use plot? in 
IPython or Jupyter). 
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Line plots can additionally have markers to highlight the actual data points. Since 
matplotlib creates a continuous line plot, interpolating between points, it can occa¬ 
sionally be unclear where the points lie. The marker can be part of the style string, 
which must have color followed by marker type and line style (see Figure 9-6): 

In [30]: from inport randn 


In [31]: pit,plot(randn(30).cumsum( ), 'ko--') 



Figure 9-6. Line plot with markers 


This could also have been written more explicitly as: 

plot(randn(30).cunsum( ), color='k', linestyle= 'dashed 1 , narker='o') 

For line plots, you will notice that subsequent points are linearly interpolated by 
default. This can be altered with the drawstyle option (Figure 9-7): 

In [33]: data = np.random.randn(30).cumsum( ) 

In [34]: plt.plot(data, 'k- -' , label= 'Default 1 ) 

0ut[34]: [<matplotlib.lines.Llne2D at 0x7fb624d86160>] 

In [35]: plt.plot(data, 'k-' , drawstyle=' steps-post' , label='steps-post' ) 

0ut[35]: [<matplotlib.lines.Line2D at 0x7fb624d869e8>] 

In [36]: pit.legend(loc= 'best' ) 
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Figure 9-7. Line plot with different drawstyle options 

You may notice output like <matplotlib. lines. Line2D at ... > when you run this, 
matplotlib returns objects that reference the plot subcomponent that was just added. 
A lot of the time you can safely ignore this output. Here, since we passed the label 
arguments to plot, we are able to create a plot legend to identify each line using 
pit. legend. 

You must call pit. legend (or ax. legend, if you have a reference to 
the axes) to create the legend, whether or not you passed the label 
options when plotting the data. 


Ticks, Labels, and Legends 

For most kinds of plot decorations, there are two main ways to do things: using the 
procedural pyplot interface (i.e., matplotlib.pyplot) and the more object-oriented 
native matplotlib API. 

The pyplot interface, designed for interactive use, consists of methods like xltm, 
xticks, and xticklabels. These control the plot range, tick locations, and tick labels, 
respectively. They can be used in two ways: 
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• Called with no arguments returns the current parameter value (e.g., plt.xlim() 
returns the current x-axis plotting range) 

• Called with parameters sets the parameter value (e.g., plt.xlim([0, 10]), sets 
the x-axis range to 0 to 10) 

All such methods act on the active or most recently created AxesSubplot. Each of 
them corresponds to two methods on the subplot object itself; in the case of xlim 
these are ax. get_xlim and ax. set_xlim. I prefer to use the subplot instance methods 
myself in the interest of being explicit (and especially when working with multiple 
subplots), but you can certainly use whichever you find more convenient. 

Setting the title, axis labels, ticks, and ticklabels 

To illustrate customizing the axes, I’ll create a simple figure and plot of a random 
walk (see Figure 9-8): 


In [37]: ftg = plt.figureQ 

In [38]: ax = fig.add_subplot(l, 1, 1) 

In [39]: ax.plot(np.random. randn(1000) .cumsum( )) 



To change the x-axis ticks, it’s easiest to use set_xticks and set_xticklabels. The 
former instructs matplotlib where to place the ticks along the data range; by default 
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these locations will also be the labels. But we can set any other values as the labels 
using set_xticklabels: 

In [40]: ticks = ax.set_xticks( [0, 250, 500, 750, 1000]) 

In [41]: labels = ax.set_xticklabels( [ 'one' , 'two', 'three', 'four', 'five'], 
....: rotation=30, fontsize= 'snail' ) 

The rotation option sets the x tick labels at a 30-degree rotation. Lastly, set_xlabel 
gives a name to the x-axis and set_title the subplot title (see Figure 9-9 for the 
resulting figure): 

In [42]: ax.set_title( 'My first matplotlib plot') 

0ut[42]: <natplotlib.text.Text at 0x7fb624d055f8> 



Figure 9-9. Simple plot for illustrating xticks 


Modifying the y-axis consists of the same process, substituting y for x in the above. 
The axes class has a set method that allows batch setting of plot properties. From the 
prior example, we could also have written: 

props = { 

'title': 'My first natplotlib plot', 

'xlabel': 'Stages' 

} 

ax.set(**props) 
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Adding legends 

Legends are another critical element for identifying plot elements. There are a couple 
of ways to add one. The easiest is to pass the label argument when adding each piece 
of the plot: 

In [44]: from import randn 

In [45]: fig = plt.figureQ; ax = fig.add_subplot(l, 1 , 1) 

In [46]: ax.plot(randn(1000).cumsum( ), 'k', label='one') 

0ut[46]: [<matplotlib.lines.Line2D at 0x7fb624bdf860>] 

In [47]: ax.plot(randn(1000).cumsum( ), ’k-- 1 , label='two') 

0ut[47]: [<matplotlib.lines.Line2D at 0x7fb624be90f0>] 

In [48]: ax.plot(randn(1000).cumsum( ), 'k.' , label= 'three' ) 

0ut[48]: [<matplotlib.lines.Line2D at 0x7fb624be9160>] 


Once you’ve done this, you can either call ax.legend() or pit.legend() to automat¬ 
ically create a legend. The resulting plot is in Figure 9-10: 

In [49]: ax.legend(loc= 'best 1 ) 



The legend method has several other choices for the location loc argument. See the 
docstring (with ax.legend?) for more information. 
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The loc tells matplotlib where to place the plot. If you aren’t picky, ' best 1 is a good 
option, as it will choose a location that is most out of the way. To exclude one or more 
elements from the legend, pass no label or label=' _nolegend_ 1 . 

Annotations and Drawing on a Subplot 

In addition to the standard plot types, you may wish to draw your own plot annota¬ 
tions, which could consist of text, arrows, or other shapes. You can add annotations 
and text using the text, arrow, and annotate functions, text draws text at given 
coordinates (x, y) on the plot with optional custom styling: 

ax.text(x, y, 'Hetlo wortd!', 

family=' monospace' , fontsize=10) 

Annotations can draw both text and arrows arranged appropriately. As an example, 
let’s plot the closing S&P 500 index price since 2007 (obtained from Yahoo! Finance) 
and annotate it with some of the important dates from the 2008-2009 financial crisis. 
You can most easily reproduce this code example in a single cell in a Jupyter note¬ 
book. See Figure 9-11 for the result: 

from import datetime 

fig = plt.figureQ 

ax = fig.add_subplot(l, 1, 1) 

data = pd.read_csv( 'examptes/spx.csv' , index_col=0, parse_dates=True) 
spx = data[ 1 SPX' ] 

spx.plot(ax=ax, style='k-') 

crisis_data = [ 

(datetime(2007, 10, 11), 'Peak of bull market'), 

(datetime(2008, 3, 12), 'Bear Stearns Fails'), 

(datetime(2008, 9, 15), 'Lehman Bankruptcy') 

] 

for date, label in crisis_data: 

ax.annotate(label, xy=(date, spx.asof(date) + 75), 
xytext=(date, spx.asof(date) + 225), 

arrowprops=dict(facecolor= 'black' , headwidth=4, width=2, 
headlength=4) , 

horizontalalignment=' left' , verticalalignment= 'top' ) 

# Zoom in on 2007-2010 

ax . set_xlim( [ 1 1/1/2007' , ' 1/1/2011' ]) 

ax.set_ylim( [600, 1800]) 

ax.set_title(' Important dates in the 2008-2009 financial crisis') 
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Figure 9-11. Important dates in the 2008-2009financial crisis 


There are a couple of important points to highlight in this plot: the ax.annotate 
method can draw labels at the indicated x and y coordinates. We use the set_xlim 
and set_ylim methods to manually set the start and end boundaries for the plot 
rather than using matplotlib’s default. Lastly, ax.set_title adds a main title to the 
plot. 

See the online matplotlib gallery for many more annotation examples to learn from. 

Drawing shapes requires some more care, matplotlib has objects that represent many 
common shapes, referred to as patches. Some of these, like Rectangle and Circle, are 
found in matplotlib. pyplot, but the full set is located in matplotlib. patches. 

To add a shape to a plot, you create the patch object shp and add it to a subplot by 
calling ax.add_patch(shp) (see Figure 9-12): 

fig = ptt.flgure() 

ax = fig.add_subptot(l, 1, 1) 

rect = ptt.Rectangle((0.2, 0.75), 0.4, 0.15, color='k', alpha=0.3) 
clrc = ptt.Circle((0.7, 0.2), 0.15, cotor='b', alpha=0.3) 
pgon = plt.Polygon([[0.15, 0.15], [0.35, 0.4], [0.2, 0.6]], 
color='g', atpha=0.5) 


ax.add_patch(rect) 
ax.add_patch(circ) 
ax.add_patch(pgon) 
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Figure 9-12. Data visualization composed from three different patches 

If you look at the implementation of many familiar plot types, you will see that they 
are assembled from patches. 

Saving Plots to File 

You can save the active figure to file using pit. savefig. This method is equivalent to 
the figure object’s savefig instance method. For example, to save an SVG version of a 
figure, you need only type: 

pit.savefig ( 1 figpath.svg' ) 

The file type is inferred from the file extension. So if you used .pdf instead, you 
would get a PDF. There are a couple of important options that I use frequently for 
publishing graphics: dpt, which controls the dots-per-inch resolution, and 
bbox_inches, which can trim the whitespace around the actual figure. To get the 
same plot as a PNG with minimal whitespace around the plot and at 400 DPI, you 
would do: 

pit.savefig(' figpath.png' , dpi=400, bbox_inches= 'tight' ) 

savefig doesn’t have to write to disk; it can also write to any file-like object, such as a 
BytesIO: 

from import BytesIO 
buffer = BytesIOQ 
pit.savefig ( buffer ) 
plot_data = buffer. getvalue() 

See Table 9-2 for a list of some other options for savefig. 
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Table 9-2. Figure.savefig options 


Argument Description 


fname 

dpi 

facecolor, 
edgecolor 
format 
bbox Inches 


String containing a filepath or a Python file-like object. The figure format is inferred from the file 
extension (e.g., . pdf for PDF or. png for PNG) 

The figure resolution in dots per inch; defaults to 100 out of the box but can be configured 
The color of the figure background outside of the subplots; 1 w' (white), by default 

The explicit file format to use (' png', 'pdf', ' svg', ' ps', ' eps',...) 

The portion of the figure to save; if' tight' is passed, will attempt to trim the empty space around 
the figure 


matplotlib Configuration 

matplotlib comes configured with color schemes and defaults that are geared primar¬ 
ily toward preparing figures for publication. Fortunately, nearly all of the default 
behavior can be customized via an extensive set of global parameters governing figure 
size, subplot spacing, colors, font sizes, grid styles, and so on. One way to modify the 
configuration programmatically from Python is to use the rc method; for example, to 
set the global default figure size to be 10 x 10, you could enter: 

pit.rc( 'figure' , flgslze=(10, 10)) 

The first argument to rc is the component you wish to customize, such as ' figure', 
'axes', 'xtick', 'ytick', 'grid', ' legend ', or many others. After that can follow a 
sequence of keyword arguments indicating the new parameters. An easy way to write 
down the options in your program is as a diet: 

font_options = {'family' : 'monospace', 

'weight' : 'bold', 

'size' : 'small'} 
pit.rc( 'font' , **font_options) 

For more extensive customization and to see a list of all the options, matplotlib comes 
with a configuration file matplotlibrc in the matplotlib/mpl-data directory. If you cus¬ 
tomize this file and place it in your home directory titled .matplotlibrc, it will be 
loaded each time you use matplotlib. 

As we’ll see in the next section, the seaborn package has several built-in plot themes 
or styles that use matplotlib’s configuration system internally. 

9.2 Plotting with pandas and seaborn 

matplotlib can be a fairly low-level tool. You assemble a plot from its base compo¬ 
nents: the data display (i.e., the type of plot: line, bar, box, scatter, contour, etc.), leg¬ 
end, title, tick labels, and other annotations. 
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In pandas we may have multiple columns of data, along with row and column labels, 
pandas itself has built-in methods that simplify creating visualizations from Data- 
Frame and Series objects. Another library is seaborn, a statistical graphics library cre¬ 
ated by Michael Waskom. Seaborn simplifies creating many common visualization 
types. 

Importing seaborn modifies the default matplotlib color schemes 
and plot styles to improve readability and aesthetics. Even if you do 
not use the seaborn API, you may prefer to import seaborn as a 
simple way to improve the visual aesthetics of general matplotlib 
plots. 

Line Plots 

Series and DataFrame each have a plot attribute for making some basic plot types. By 
default, plot() makes line plots (see Figure 9-13): 

In [60]: s = pd.Series(np.random.randn(10).cumsum( ), index=np.arange(0, 100, 10)) 




Figure 9-13. Simple Series plot 


The Series object’s index is passed to matplotlib for plotting on the x-axis, though you 
can disable this by passing use_index=False. The x-axis ticks and limits can be 
adjusted with the xtlcks and xlim options, and y-axis respectively with yticks and 
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ylim. See Table 9-3 for a full listing of plot options. I’ll comment on a few more of 
them throughout this section and leave the rest to you to explore. 

Most of pandas’s plotting methods accept an optional ax parameter, which can be a 
matplotlib subplot object. This gives you more flexible placement of subplots in a grid 
layout. 

DataFrame’s plot method plots each of its columns as a different line on the same 
subplot, creating a legend automatically (see Figure 9-14): 

In [62]: df = pd.DataFrane(np.random.randn(10, 4) .cumsum(0), 

....: columns=[ 'A' , 'B', 'C', 'D ' ], 

....: tndex=np.arange(0, 100, 10)) 


In [63]: df.plotQ 



The plot attribute contains a “family” of methods for different plot types. For exam¬ 
ple, df .plotQ is equivalent to df. plot .lineQ. We’ll explore some of these methods 
next. 



Additional keyword arguments to plot are passed through to the 
respective matplotlib plotting function, so you can further custom¬ 
ize these plots by learning more about the matplotlib API. 
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Table 9-3. Series.plot method arguments 


1 Argument 

Description II 

label 

Label for plot legend 

ax 

matplotlib subplot object to plot on; if nothing passed, uses active matplotlib subplot 

style 

Style string, like 1 ko- -to be passed to matplotlib 

alpha 

The plot fill opacity (from 0 to 1) 

kind 

Can be 'area', 'bar', 'barh', 'density 1 , 'hist', 'kde', 'line', 'pie' 

logy 

Use logarithmic scaling on the y-axis 

use_tndex 

Use the object index for tick labels 

rot 

Rotation of tick labels (0 through 360) 

xticks 

Values to use for x-axis ticks 

ytlcks 

Values to use for y-axis ticks 

xlin 

x-axis limits (e.g., [0, 10]) 

ylin 

y-axis limits 

grid 

Display axis grid (on by default) 


DataFrame has a number of options allowing some flexibility with how the columns 
are handled; for example, whether to plot them all on the same subplot or to create 
separate subplots. See Table 9-4 for more on these. 


Table 9-4. DataFrame-specific plot arguments 


1 Argument 

Description 1 

subplots 

Plot each DataFrame column in a separate subplot 

sharex 

If subplots=True, share the same x-axis, linking ticks and limits 

sharey 

If subplots=True, share the same y-axis 

figsize 

Size of figure to create as tuple 

title 

Plot title as string 

legend 

Add a subplot legend (True by default) 

sort_columns 

Plot columns in alphabetical order; by default uses existing column order 



For time series plotting, see Chapter 11. 
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Bar Plots 


The plot.barQ and plot.barhQ make vertical and horizontal bar plots, respec¬ 
tively. In this case, the Series or DataFrame index will be used as the x (bar) or y 
(barh) ticks (see Figure 9-15): 

In [64]: fig, axes = plt.subplots(2, 1) 

In [65]: data = pd.Series(np.random. rand(16) , index=tist( 'abcdefghijklmnop' )) 

In [66]: data.plot. bar(ax=axes[0] , color='k', alpha=0.7) 

0ut[66]: <matplotlib.axes._subplots.AxesSubplot at 0x7fb62493d470> 


In [67]: data.plot.barh(ax=axes[l] , color='k', alpha=0.7) 



Figure 9-15. Horizonal and vertical bar plot 


The options color= ' k 1 and alpha=0.7 set the color of the plots to black and use par¬ 
tial transparency on the filling. 
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With a DataFrame, bar plots group the values in each row together in a group in bars, 
side by side, for each value. See Figure 9-16: 

In [69]: df = pd.DataFrane(np.random.rand(6, 4), 

....: tndex=[' one 1 , 'two', 'three', 'four 1 , 'five', 'six'], 

....: columns=pd.Index( [' A' , 'B 1 , 'C, 'D'], name= ' Genus' )) 


In [70] 

: df 



Out[70] 




Genus 

A 

B 

C 

one 

0.370670 

0.602792 

0.229159 

two 

0.420082 

0.571653 

0.049024 

three 

0.814568 

0.277160 

0.880316 

four 

0.374020 

0.899420 

0.460304 

five 

0.433270 

0.125107 

0.494675 

six 

0.601648 

0.478576 

0.205690 


D 

0.486744 

0.880592 

0.431326 

0.100843 

0.961825 

0.560547 


In [71]: df.plot.bar( ) 


1.0 -i- 

Genus 



Figure 9-16. DataFrame bar plot 


Note that the name “Genus” on the DataFrame’s columns is used to title the legend. 
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We create stacked bar plots from a DataFrame by passing stacked=True, resulting in 
the value in each row being stacked together (see Figure 9-17): 


In [73]: df.plot.barh(stacked=True, alpha=0.5) 



Figure 9-17. DataFrame stacked bar plot 



A useful recipe for bar plots is to visualize a Series’s value frequency 
using value counts: s. valuecountsQ .plot. bar(). 


Returning to the tipping dataset used earlier in the book, suppose we wanted to make 
a stacked bar plot showing the percentage of data points for each party size on each 
day. I load the data using read_csv and make a cross-tabulation by day and party size: 


In [75]: 

tips = pd.read. 

_csv( 1 exanples/tips.csv' ) 

In [76]: 

party_counts = 

pd.crosstab(tips [' day' ], 

In [77]: 
0ut[77] : 

party_counts 


size 1 
day 

2 3 4 5 

6 

Frl 

1 

16 1 1 0 

0 

Sat 

2 

53 18 13 1 

0 

Sun 

0 

39 15 18 3 

1 

Thur 1 

48 4 5 1 

3 
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# Not many 1- and 6-person parties 

In [78]: party_counts = party_counts. loc[:, 2:5] 

Then, normalize so that each row sums to 1 and make the plot (see Figure 9-18): 

# Normalize to sum to 1 

In [79]: party_pcts = party_counts.div(party_counts.sum(l), axis=0) 


In [80]: party_pcts 
Out[80] : 


size 

2 

3 

4 

5 

day 

Fri 

0.888889 

0.055556 

0.055556 

0.000000 

Sat 

0.623529 

0.211765 

0.152941 

0.011765 

Sun 

0.520000 

0.200000 

0.240000 

0.040000 

Thur 

0.827586 

0.068966 

0.086207 

0.017241 


In [81]: party_pcts.plot.bar() 



day 


Figure 9-18. Fraction of parties by size on each day 


So you can see that party sizes appear to increase on the weekend in this dataset. 

With data that requires aggregation or summarization before making a plot, using the 
seaborn package can make things much simpler. Let’s look now at the tipping per¬ 
centage by day with seaborn (see Figure 9-19 for the resulting plot): 
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In [83]: import as 


In [84]: tips[ 'tip_pct 1 ] = ttps['tip'] / (tips[' total_bill' ] - tips['tip']) 


In [85]: tips.head() 
0ut[85] : 



totat_bill 

tip smoker 

day 

time 

size 

tip_pct 

0 

16.99 

1.01 

No 

Sun 

Dinner 

2 

0.063204 

1 

10.34 

1.66 

No 

Sun 

Dinner 

3 

0.191244 

2 

21.01 

3.50 

No 

Sun 

Dinner 

3 

0.199886 

3 

23.68 

3.31 

No 

Sun 

Dinner 

2 

0.162494 

4 

24.59 

3.61 

No 

Sun 

Dinner 

4 

0.172069 

In 

[86]: sns.l 

narplot(x= 

'tip. 

.pet' 

, y='day' 

, data=tips, orient; 


Sun 


Sat 


Thur 


Fri 


0.00 0.05 0.10 0.15 0.20 0.25 0.30 

tip_pct 


Figure 9-19. Tipping percentage by day with error bars 



Plotting functions in seaborn take a data argument, which can be a pandas Data- 
Frame. The other arguments refer to column names. Because there are multiple 
observations for each value in the day, the bars are the average value of tip_pct. The 
black lines drawn on the bars represent the 95% confidence interval (this can be con¬ 
figured through optional arguments). 
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seaborn. barplot has a hue option that enables us to split by an additional categorical 
value (Figure 9-20): 

In [88]: sns.barptot(x= 'tip_pct' , y='day', hue=' tine 1 , data=tips, orient='h') 



Figure 9-20. Tipping percentage by day and time 

Notice that seaborn has automatically changed the aesthetics of plots: the default 
color palette, plot background, and grid line colors. You can switch between different 
plot appearances using seaborn. set: 

In [90]: sns. set(style="whitegrLd" ) 

Histograms and Density Plots 

A histogram is a kind of bar plot that gives a discretized display of value frequency. 
The data points are split into discrete, evenly spaced bins, and the number of data 
points in each bin is plotted. Using the tipping data from before, we can make a histo¬ 
gram of tip percentages of the total bill using the plot, hist method on the Series 
(see Figure 9-21): 

In [92]: tips[' tip_pct' ]. plot.hist(bins=50) 
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Figure 9-21. Histogram of tip percentages 


A related plot type is a density plot, which is formed by computing an estimate of a 
continuous probability distribution that might have generated the observed data. The 
usual procedure is to approximate this distribution as a mixture of “kernels”—that is, 
simpler distributions like the normal distribution. Thus, density plots are also known 
as kernel density estimate (KDE) plots. Using plot. kde makes a density plot using 
the conventional mixture-of-normals estimate (see Figure 9-22): 

In [94]: tips [' tip_pct ']. plot.denslty( ) 
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Figure 9-22. Density plot of tip percentages 


Seaborn makes histograms and density plots even easier through its distplot 
method, which can plot both a histogram and a continuous density estimate simulta¬ 
neously. As an example, consider a bimodal distribution consisting of draws from 
two different standard normal distributions (see Figure 9-23): 

In [96]: compl = np.random.normal(0, 1, stze=200) 

In [97]: comp2 = np.random.normal(10, 2, size=200) 

In [98]: values = pd.Sertes(np.concatenate([compl, comp2])) 

In [99]: sns.dtstptot(vaIues, bins=100, color='k') 
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Figure 9-23. Normalized histogram of normal mixture with density estimate 


Scatter or Point Plots 

Point plots or scatter plots can be a useful way of examining the relationship between 
two one-dimensional data series. For example, here we load the macrodata dataset 
from the statsmodels project, select a few variables, then compute log differences: 

In [100]: macro = pd.read_csv( 'examples/macrodata.csv' ) 

In [101]: data = macro[ [ 1 cpt 1 , 'ml', 'tbitrate', 'unemp']] 

In [102]: trans_data = np.log(data).diff().dropna() 


In [103]: trans_data[-5: ] 
Out[103] : 



cpi 

ml 

tbtlrate 

unemp 

198 

-0.007904 

0.045361 -0.396881 

0.105361 

199 

-0.021979 

0.066753 -2.277267 

0.139762 

200 

0.002340 

0.010286 

0.606136 

0.160343 

201 

0.008419 

0.037461 -0.200671 

0.127339 

202 

0.008894 

0.012202 -0.405465 

0.042560 
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We can then use seaborn’s regplot method, which makes a scatter plot and fits a lin¬ 
ear regression line (see Figure 9-24): 

In [105]: sns.regptot( 'ml 1 , 'unemp', data=trans_data) 

Out[105]: <matplotlib.axes._subplots.AxesSubplot at 0x7fb613720be0> 


In [106]: plt.title( 'Changes in log %s versus log %s' % ('ml', 'unemp')) 



Figure 9-24. A seaborn regression/scatter plot 


In exploratory data analysis it’s helpful to be able to look at all the scatter plots among 
a group of variables; this is known as a pairs plot or scatter plot matrix. Making such a 
plot from scratch is a bit of work, so seaborn has a convenient pairplot function, 
which supports placing histograms or density estimates of each variable along the 
diagonal (see Figure 9-25 for the resulting plot): 

In [107]: sns.pairplot(trans_data, diag_kind= 'kde' , plot_kws={ 'alpha' : 0.2}) 
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Figure 9-25. Pair plot matrix of statsmodels macro data 


You may notice the plot_kws argument. This enables us to pass down configuration 
options to the individual plotting calls on the off-diagonal elements. Check out the 
seaborn. pairplot docstring for more granular configuration options. 
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Facet Grids and Categorical Data 

What about datasets where we have additional grouping dimensions? One way to vis¬ 
ualize data with many categorical variables is to use a facet grid. Seaborn has a useful 
built-in function factorplot that simplifies making many kinds of faceted plots (see 
Figure 9-26 for the resulting plot): 


In [108]: sns.factorplot(x= 'day' , y='tip_pct', hue='time', col=' smoker' , 
.: kind='bar', data=tlps[tips.tip_pct < 1]) 



Figure 9-26. Tipping percentage by day/time/smoker 


Instead of grouping by 'tine' by different bar colors within a facet, we can also 
expand the facet grid by adding one row per tine value (Figure 9-27): 

In [109]: sns.factorplot(x= 'day' , y='tip_pct', row='tlme', 

.: col='smoker', 

.: kind='bar', data=tips[tips.tip_pct < 1]) 
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Figure 9-27. tip_pct by day; facet by time/smoker 

factorplot supports other plot types that may be useful depending on what you are 
trying to display. For example, box plots (which show the median, quartiles, and out¬ 
liers) can be an effective visualization type (Figure 9-28): 

In [110]: sns.factorplot(x= 'tip_pct 1 , y='day', ktnd='box', 

.: data=tips[tips.tlp_pct < 0.5]) 
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Figure 9-28. Box plot oftip_pct by day 

You can create your own facet grid plots using the more general seaborn. FacetGrid 
class. See the seaborn documentation for more. 

9.3 Other Python Visualization Tools 

As is common with open source, there are a plethora of options for creating graphics 
in Python (too many to list). Since 2010, much development effort has been focused 
on creating interactive graphics for publication on the web. With tools like Bokeh and 
Plotly, it’s now possible to specify dynamic, interactive graphics in Python that are 
destined for a web browser. 

For creating static graphics for print or web, I recommend defaulting to matplotlib 
and add-on libraries like pandas and seaborn for your needs. For other data visualiza¬ 
tion requirements, it may be useful to learn one of the other available tools out there. 
I encourage you to explore the ecosystem as it continues to involve and innovate into 
the future. 



9.3 Other Python Visualization Tools | 285 











9.4 Conclusion 

The goal of this chapter was to get your feet wet with some basic data visualization 
using pandas, matplotlib, and seaborn. If visually communicating the results of data 
analysis is important in your work, I encourage you to seek out resources to learn 
more about effective data visualization. It is an active field of research and you can 
practice with many excellent learning resources available online and in print form. 

In the next chapter, we turn our attention to data aggregation and group operations 
with pandas. 
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CHAPTER 10 


Data Aggregation and Group Operations 


Categorizing a dataset and applying a function to each group, whether an aggregation 
or transformation, is often a critical component of a data analysis workflow. After 
loading, merging, and preparing a dataset, you may need to compute group statistics 
or possibly pivot tables for reporting or visualization purposes, pandas provides a 
flexible groupby interface, enabling you to slice, dice, and summarize datasets in a 
natural way. 

One reason for the popularity of relational databases and SQL (which stands for 
“structured query language”) is the ease with which data can be joined, filtered, trans¬ 
formed, and aggregated. However, query languages like SQL are somewhat con¬ 
strained in the kinds of group operations that can be performed. As you will see, with 
the expressiveness of Python and pandas, we can perform quite complex group oper¬ 
ations by utilizing any function that accepts a pandas object or NumPy array. In this 
chapter, you will learn how to: 

• Split a pandas object into pieces using one or more keys (in the form of func¬ 
tions, arrays, or DataFrame column names) 

• Calculate group summary statistics, like count, mean, or standard deviation, or a 
user-defined function 

• Apply within-group transformations or other manipulations, like normalization, 
linear regression, rank, or subset selection 

• Compute pivot tables and cross-tabulations 

• Perform quantile analysis and other statistical group analyses 
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Aggregation of time series data, a special use case of groupby, is 
referred to as resampling in this book and will receive separate 
treatment in Chapter 11 . 


10.1 GroupBy Mechanics 

Hadley Wickham, an author of many popular packages for the R programming lan¬ 
guage, coined the term split-apply-combine for describing group operations. In the 
first stage of the process, data contained in a pandas object, whether a Series, Data- 
Frame, or otherwise, is split into groups based on one or more keys that you provide. 
The splitting is performed on a particular axis of an object. For example, a DataFrame 
can be grouped on its rows (axis=0) or its columns (axis=l). Once this is done, a 
function is applied to each group, producing a new value. Finally, the results of all 
those function applications are combined into a result object. The form of the result¬ 
ing object will usually depend on what’s being done to the data. See Figure 10-1 for a 
mockup of a simple group aggregation. 



Figure 10-1. Illustration of a group aggregation 


Each grouping key can take many forms, and the keys do not have to be all of the 
same type: 

• A list or array of values that is the same length as the axis being grouped 

• A value indicating a column name in a DataFrame 
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A diet or Series giving a correspondence between the values on the axis being 
grouped and the group names 

A function to be invoked on the axis index or the individual labels in the index 


Note that the latter three methods are shortcuts for producing an array of values to be 
used to split up the object. Don’t worry if this all seems abstract. Throughout this 
chapter, I will give many examples of all these methods. To get started, here is a small 
tabular dataset as a DataFrame: 


In [10]: df = pd.DataFrame({' keyl' 

. . .. : 1 key2 1 
....: 1 datal' 
....: 1 data2' 


1 'a' , a 1 , ' b' , ' b' , 'a 1 ], 

['one', 'two', 'one', 'two', 'one'], 
np.random.randn(5) , 
np.random.randn(5)}) 


In [11]: df 
Out[ll] : 

datal 
0 -0.204708 

1 0.478943 

2 -0.519439 

3 -0.555730 

4 1.965781 


data2 keyl key2 
1.393406 a one 
0.092908 a two 
0.281746 b one 
0.769023 b two 
1.246435 a one 


Suppose you wanted to compute the mean of the datal column using the labels from 
keyl. There are a number of ways to do this. One is to access datal and call groupby 
with the column (a Series) at keyl: 


In [12]: grouped = df[ 'datal '] groupby(df[ 'keyl ']) 


In [13]: grouped 

0ut[13]: <pandas.core.groupby.SeriesGroupBy object at 0x7faa31537390> 

This grouped variable is now a GroupBy object. It has not actually computed anything 
yet except for some intermediate data about the group key df [' keyl 1 ]. The idea is 
that this object has all of the information needed to then apply some operation to 
each of the groups. For example, to compute group means we can call the GroupBy’s 
mean method: 


In [14]: grouped.mean( ) 

0ut[14] : 

keyl 

a 0.746672 
b -0.537585 

Name: datal, dtype: float64 


Later, I’ll explain more about what happens when you call .mean(). The important 
thing here is that the data (a Series) has been aggregated according to the group key, 
producing a new Series that is now indexed by the unique values in the keyl column. 
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The result index has the name 'keyl' because the DataFrame column df [' keyl ' ] 
did. 

If instead we had passed multiple arrays as a list, wed get something different: 

In [15]: means = df[' datal'] ,groupby( [df[ 'keyl'] , df[' key2' ]]) .mean() 

In [16]: means 
0ut[16] : 
keyl key2 

a one 0.880536 

two 0.478943 

b one -0.519439 

two -0.555730 

Name: datal, dtype: float64 

Here we grouped the data using two keys, and the resulting Series now has a hier¬ 
archical index consisting of the unique pairs of keys observed: 

In [17]: means.unstackQ 
0ut[17] : 

key2 one two 

keyl 

a 0.880536 0.478943 

b -0.519439 -0.555730 

In this example, the group keys are all Series, though they could be any arrays of the 
right length: 

In [18]: states = np.array( [ 'Ohio' , 'California', 'California', 'Ohio 1 , 'Ohio']) 

In [19]: years = np.array([2005, 2005, 2006, 2005, 2006]) 

In [20]: df['datal'] groupby([states, years]) .mean() 

Out[20] : 


California 

2005 

0.478943 


2006 

-0.519439 

Ohio 

2005 

-0.380219 


2006 

1.965781 


Name: datal, dtype: float64 

Frequently the grouping information is found in the same DataFrame as the data you 
want to work on. In that case, you can pass column names (whether those are strings, 
numbers, or other Python objects) as the group keys: 

In [21]: df.groupby(' keyl ') .mean( ) 

0ut[21] : 

datal data2 

keyl 

a 0.746672 0.910916 

b -0.537585 0.525384 

In [22]: df.groupby( [ 'keyl' , 'key2' ]) ,mean( ) 
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0ut[22] : 


keyl 

key2 

datal 

data2 

a 

one 

0.880536 

1.319920 


two 

0.478943 

0.092908 

b 

one 

-0.519439 

0.281746 


two 

-0.555730 

0.769023 


You may have noticed in the first case df .groupby(' keyl 1 ) .mean() that there is no 
key2 column in the result. Because df [' key2 ' ] is not numeric data, it is said to be a 
nuisance column, which is therefore excluded from the result. By default, all of the 
numeric columns are aggregated, though it is possible to filter down to a subset, as 
you’ll see soon. 

Regardless of the objective in using groupby, a generally useful GroupBy method is 
size, which returns a Series containing group sizes: 

In [23]: df.groupby( [ 1 keyl' , 'key2' ]). size( ) 

0ut[23] : 
keyl key2 
a one 2 

two 1 

b one 1 

two 1 

dtype: tnt64 

Take note that any missing values in a group key will be excluded from the result. 


Iterating Over Groups 

The GroupBy object supports iteration, generating a sequence of 2-tuples containing 
the group name along with the chunk of data. Consider the following: 


In [24] 


for name, group in df.groupby( 'keyl' ): 
print(name) 
print(group) 


datal 
0 -0.204708 

1 0.478943 

4 1.965781 

b 

datal 

2 -0.519439 

3 -0.555730 


data2 keyl key2 
1.393406 a one 
0.092908 a two 
1.246435 a one 

data2 keyl key2 
0.281746 b one 
0.769023 b two 


In the case of multiple keys, the first element in the tuple will be a tuple of key values: 


In [25]: for (kl, k2), group In df,groupby( [ 1 keyl' , 'key2']): 
....: print((kl, k2)) 

....: print(group) 
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( ' a' , ' one' ) 

datal data2 keyl key2 

0 -0.204708 1.393406 a one 

4 1.965781 1.246435 a one 

( 1 a' , ' two' ) 

datal data2 keyl key2 

1 0.478943 0.092908 a two 

( ' b' , ' one' ) 

datal data2 keyl key2 

2 -0.519439 0.281746 b one 

( ' b' , ' two' ) 

datal data2 keyl key2 

3 -0.55573 0.769023 b two 

Of course, you can choose to do whatever you want with the pieces of data. A recipe 
you may find useful is computing a diet of the data pieces as a one-liner: 

In [26]: pieces = dict(list(df ,groupby( 'keyl' ))) 

In [27]: pieces['b'] 

0ut[27] : 

datal data2 keyl key2 

2 -0.519439 0.281746 b one 

3 -0.555730 0.769023 b two 

By default groupby groups on axis=0, but you can group on any of the other axes. 
For example, we could group the columns of our example df here by dtype like so: 

In [28]: df.dtypes 
0ut[28] : 

datal ftoat64 

data2 ftoat64 

keyl object 

key2 object 

dtype: object 


In [29]: grouped = df.groupby(df.dtypes, axts=l) 

We can print out the groups like so: 


In [30]: for 

dtype, group 


print(dtype) 


print(group) 

float64 

datal 

data2 

0 -0.204708 

1.393406 

1 0.478943 

0.092908 

2 -0.519439 

0.281746 

3 -0.555730 

0.769023 

4 1.965781 

1.246435 


object 
keyl key2 
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0 a one 

1 a two 

2 b one 

3 b two 

4 a one 


Selecting a Column or Subset of Columns 

Indexing a GroupBy object created from a DataFrame with a column name or array 
of column names has the effect of column subsetting for aggregation. This means 
that: 


df,groupby(’ keyl' )[' datal'] 
df,groupby(' keyl' )[[’ data2 1 ]] 

are syntactic sugar for: 

df [ 'datal' ]. groupby(df [ 1 keyl' ]) 
df [[' data2' ]] groupby(df [ 'keyl' ]) 

Especially for large datasets, it may be desirable to aggregate only a few columns. For 
example, in the preceding dataset, to compute means for just the data2 column and 
get the result as a DataFrame, we could write: 


In [31]: df.groupby( [ 1 keyl' , 
0ut[31] : 

data2 


keyl key2 

a one 1.319920 

two 0.092908 
b one 0.281746 

two 0.769023 


’key2 '])[[' data2' ]] .mean() 


The object returned by this indexing operation is a grouped DataFrame if a list or 
array is passed or a grouped Series if only a single column name is passed as a scalar: 


In [32]: s_grouped = df,groupby( [' keyl' , 'key2' ])[ 'data2' ] 


In [33]: s_grouped 

0ut[33]: <pandas.core.groupby.SeriesGroupBy object at 0x7faa30c78da0> 


In [34]: s_grouped.mean() 
0ut[34] : 
keyl key2 

a one 1.319920 

two 0.092908 

b one 0.281746 

two 0.769023 

Name: data2, dtype: float64 
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Grouping with Diets and Series 

Grouping information may exist in a form other than an array. Let’s consider another 
example DataFrame: 

In [35]: people = pd.DataFrame(np.random.randn(5, 5), 

....: columns=[ 1 a 1 , ' b' , ’c’, ' d' , 'e' ], 

....: tndex=[' Joe 1 , 'Steve', 'Wes', 'Jin', 'Travis']) 

In [36]: people. iloc[2:3, [1, 2]] = np.nan # Add a few NA values 

In [37]: people 
0ut[37] : 

a b c d e 

Joe 1.007189 -1.296221 0.274992 0.228913 1.352917 

Steve 0.886429 -2.001637 -0.371843 1.669025 -0.438570 

Wes -0.539741 NaN NaN -1.021228 -0.577087 

Jim 0.124121 0.302614 0.523772 0.000940 1.343810 

Travis -0.713544 -0.831154 -2.370232 -1.860761 -0.860757 

Now, suppose I have a group correspondence for the columns and want to sum 
together the columns by group: 

In [38]: mapping = {'a': 'red', ' b' : 'red', 'c': 'blue', 

....: 'd 1 : 'blue', 'e 1 : ’red’, ’f' : 'orange'} 

Now, you could construct an array from this diet to pass to groupby, but instead we 
can just pass the diet (I included the key ' f' to highlight that unused grouping keys 
are OK): 

In [39]: by_column = people.groupby(mapping, axis=l) 

In [40]: by_column.sum() 

Out[40] : 

blue red 

Joe 0.503905 1.063885 

Steve 1.297183 -1.553778 
Wes -1.021228 -1.116829 
Jim 0.524712 1.770545 

Travis -4.230992 -2.405455 

The same functionality holds for Series, which can be viewed as a fixed-size mapping: 

In [41]: map_series = pd.Series(mapping) 

In [42]: map_series 
0ut[42] : 
a red 

b red 

c blue 

d blue 

e red 

f orange 
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dtype: object 


In [43]: people.groupby(map_serles, axls=l). count( ) 
0ut[43] : 

blue red 


Toe 2 3 

Steve 2 3 

Wes 1 2 

Tim 2 3 

Travis 2 3 


Grouping with Functions 

Using Python functions is a more generic way of defining a group mapping compared 
with a diet or Series. Any function passed as a group key will be called once per index 
value, with the return values being used as the group names. More concretely, con¬ 
sider the example DataFrame from the previous section, which has peoples first 
names as index values. Suppose you wanted to group by the length of the names; 
while you could compute an array of string lengths, it’s simpler to just pass the len 
function: 

In [44]: people.groupby(len). sum() 

0ut[44] : 

a b c d e 

3 0.591569 -0.993608 0.798764 -0.791374 2.119639 

5 0.886429 -2.001637 -0.371843 1.669025 -0.438570 

6 -0.713544 -0.831154 -2.370232 -1.860761 -0.860757 

Mixing functions with arrays, diets, or Series is not a problem as everything gets con¬ 
verted to arrays internally: 

In [45]: key_llst = ['one', 'one', 'one', 'two', 'two'] 

In [46]: people.groupby( [len, key_list] ) ,nin( ) 

0ut[46] : 

a b c d e 

3 one -0.539741 -1.296221 0.274992 -1.021228 -0.577087 

two 0.124121 0.302614 0.523772 0.000940 1.343810 

5 one 0.886429 -2.001637 -0.371843 1.669025 -0.438570 

6 two -0.713544 -0.831154 -2.370232 -1.860761 -0.860757 

Grouping by Index Levels 

A final convenience for hierarchically indexed datasets is the ability to aggregate 
using one of the levels of an axis index. Let’s look at an example: 

In [47]: colunns = pd.Multiindex.from_arrays([ [ 'US' , 'US', 'US', 'JP' , 'JP'] , 
....: [1, 3, 5, 1, 3]], 

....: names=[ 'cty' , 'tenor']) 

In [48]: hler_df = pd.DataFrame(np.random. randn(4, 5), columns=columns) 
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In [49]: hier_df 
0ut[49] : 

cty US JP 

tenor 13513 

0 0.560145 -1.265934 0.119827 -1.063512 0.332883 

1 -2.359419 -0.199543 -1.541996 -0.970736 -1.307030 

2 0.286350 0.377984 -0.753887 0.331286 1.349742 

3 0.069877 0.246674 -0.011862 1.004812 1.327195 

To group by level, pass the level number or name using the level keyword: 

In [50]: hter_df,groupby(tevel='cty 1 , axis=l).count( ) 

Out[50] : 
cty JP US 

0 2 3 

1 2 3 

2 2 3 

3 2 3 

10.2 Data Aggregation 

Aggregations refer to any data transformation that produces scalar values from 
arrays. The preceding examples have used several of them, including mean, count, 
min, and sum. You may wonder what is going on when you invoke meanQ on a 
GroupBy object. Many common aggregations, such as those found in Table 10-1, 
have optimized implementations. However, you are not limited to only this set of 
methods. 


Table 10-1. Optimized groupby methods 


1 Function name 

Description 1 

count 

Number of non-NA values in the group 

sum 

Sum of non-NA values 

mean 

Mean of non-NA values 

median 

Arithmetic median of non-NA values 

std, var 

Unbiased (n - 1 denominator) standard deviation and variance 

min, max 

Minimum and maximum of non-NA values 

prod 

Product of non-NA values 

first, last 

First and last non-NA values 


You can use aggregations of your own devising and additionally call any method that 
is also defined on the grouped object. For example, you might recall that quantile 
computes sample quantiles of a Series or a DataFrame’s columns. 

While quantile is not explicitly implemented for GroupBy, it is a Series method and 
thus available for use. Internally, GroupBy efficiently slices up the Series, calls 
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piece.quantile(0.9) for each piece, and then assembles those results together into 
the result object: 


In [51]: df 
Out [51]: 

datal 
0 -0.204708 

1 0.478943 

2 -0.519439 

3 -0.555730 

4 1.965781 


data2 keyl key2 
1.393406 a one 
0.092908 a two 
0.281746 b one 
0.769023 b two 
1.246435 a one 


In [52]: grouped = df.groupby( 'keyl 1 ) 


In [53]: groupedf 'datal' ] .quantlle(0.9) 

0ut[53] : 
keyl 

a 1.668413 
b -0.523068 

Name: datal, dtype: float64 

To use your own aggregation functions, pass any function that aggregates an array to 
the aggregate or agg method: 

In [54]: def peak_to_peak(arr) : 

....: return arr.maxQ - arr.minQ 


In [55]: grouped.agg(peak_to_peak) 

0ut[55] : 

datal data2 

keyl 

a 2.170488 1.300498 

b 0.036292 0.487276 

You may notice that some methods like describe also work, even though they are not 
aggregations, strictly speaking: 

In [56]: grouped,descrlbe() 

0ut[56] : 

datal \ 



count 


mean 

std 

min 

25% 

50% 

75% 

keyl 









a 

3.0 

0. 

746672 

1.109736 

-0.204708 

0.137118 

0.478943 

1.222362 

b 

2.0 

-0. 

537585 

0.025662 

-0.555730 

-0.546657 

-0.537585 

-0.528512 




data2 








max 

count 

mean 

std 

min 

25% 

50% 

keyl 









a 

1.965781 

3.0 

0.910916 

0.712217 

0.092908 

0.669671 

1.246435 

b 

-0.519439 

2.0 

0.525384 

0.344556 

0.281746 

0.403565 

0.525384 


75% max 

keyl 
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a 1.319920 1.393406 

b 0.647203 0.769023 

I will explain in more detail what has happened here in Section 10.3, “Apply: General 
split-apply-combine,” on page 302. 



Custom aggregation functions are generally much slower than the 
optimized functions found in Table 10-1. This is because there is 
some extra overhead (function calls, data rearrangement) in con¬ 
structing the intermediate group data chunks. 


Column-Wise and Multiple Function Application 

Let’s return to the tipping dataset from earlier examples. After loading it with 
read_csv, we add a tipping percentage column tip_pct: 

In [57]: tips = pd.read_csv( 1 examples/tips.csv 1 ) 

# Add tip percentage of total bill 


In 

[58]: tips [ 

' tip_| 

DCt'] = 

tips[ 

'tip'] , 

l tips[ 

'totaI_bill 

In [59]: tips[:6] 
0ut[59] : 

total_bi.ll tip 

smoker 

day 

time 

size 

tip_pct 

0 

16.99 

1.01 

No 

Sun 

Dinner 

2 

0.059447 

1 

10.34 

1.66 

No 

Sun 

Dinner 

3 

0.160542 

2 

21.01 

3.50 

No 

Sun 

Dinner 

3 

0.166587 

3 

23.68 

3.31 

No 

Sun 

Dinner 

2 

0.139780 

4 

24.59 

3.61 

No 

Sun 

Dinner 

4 

0.146808 

5 

25.29 

4.71 

No 

Sun 

Dinner 

4 

0.186240 


As you’ve already seen, aggregating a Series or all of the columns of a DataFrame is a 
matter of using aggregate with the desired function or calling a method like mean or 
std. However, you may want to aggregate using a different function depending on the 
column, or multiple functions at once. Fortunately, this is possible to do, which I’ll 
illustrate through a number of examples. First, I’ll group the tips by day and smoker: 

In [60]: grouped = tips,groupby( [ 1 day' , 'smoker']) 

Note that for descriptive statistics like those in Table 10-1, you can pass the name of 
the function as a string: 

In [61]: grouped_pct = grouped [' tip_pct 1 ] 

In [62]: grouped_pct.agg( 'mean' ) 

0ut[62] : 
day smoker 

Fri No 0.151650 

Yes 0.174783 

Sat No 0.158048 
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Yes 

0.147906 

Sun 

No 

0.160113 


Yes 

0.187250 

Thur 

No 

0.160298 


Yes 

0.163863 

Name: 

tip_pct. 

dtype: float64 


If you pass a list of functions or function names instead, you get back a DataFrame 
with column names taken from the functions: 

In [63]: grouped_pct. agg([ 'mean' , 'std', peak_to_peak] ) 

0ut[63] : 


day 

smoker 

mean 

std 

peak_to_peak 

Fri 

No 

0.151650 

0.028123 

0.067349 


Yes 

0.174783 

0.051293 

0.159925 

Sat 

No 

0.158048 

0.039767 

0.235193 


Yes 

0.147906 

0.061375 

0.290095 

Sun 

No 

0.160113 

0.042347 

0.193226 


Yes 

0.187250 

0.154134 

0.644685 

Thur 

No 

0.160298 

0.038774 

0.193350 


Yes 

0.163863 

0.039389 

0.151240 


Here we passed a list of aggregation functions to agg to evaluate indepedently on the 
data groups. 

You don’t need to accept the names that GroupBy gives to the columns; notably, 
lambda functions have the name '<lambda>', which makes them hard to identify 

(you can see for yourself by looking at a function’s_name_attribute). Thus, if you 

pass a list of (name, function) tuples, the first element of each tuple will be used as 
the DataFrame column names (you can think of a list of 2-tuples as an ordered 
mapping): 

In [64]: grouped_pct. agg( [( 'foo' , 'mean'), ('bar', np.std)]) 

0ut[64] : 


day 

smoker 

foo 

bar 

Fri 

No 

0.151650 

0.028123 


Yes 

0.174783 

0.051293 

Sat 

No 

0.158048 

0.039767 


Yes 

0.147906 

0.061375 

Sun 

No 

0.160113 

0.042347 


Yes 

0.187250 

0.154134 

Thur 

No 

0.160298 

0.038774 


Yes 

0.163863 

0.039389 


With a DataFrame you have more options, as you can specify a list of functions to 
apply to all of the columns or different functions per column. To start, suppose we 
wanted to compute the same three statistics for the tip_pct and total_bi.ll 
columns: 
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In [65]: functions = ['count', 'mean', 'max'] 

In [66]: result = grouped[' tlp_pct' , 'total_blll' ]. agg(functions) 


In [67]: result 
0ut[67] : 




tip_pct 



total_bill 





count 

mean 

max 

count 

mean 

max 

day 

smoker 







Fri 

No 

4 

0.151650 

0.187735 

4 

18.420000 

22.75 


Yes 

15 

0.174783 

0.263480 

15 

16.813333 

40.17 

Sat 

No 

45 

0.158048 

0.291990 

45 

19.661778 

48.33 


Yes 

42 

0.147906 

0.325733 

42 

21.276667 

50.81 

Sun 

No 

57 

0.160113 

0.252672 

57 

20.506667 

48.17 


Yes 

19 

0.187250 

0.710345 

19 

24.120000 

45.35 

Thur 

No 

45 

0.160298 

0.266312 

45 

17.113111 

41.19 


Yes 

17 

0.163863 

0.241255 

17 

19.190588 

43.11 


As you can see, the resulting DataFrame has hierarchical columns, the same as you 
would get aggregating each column separately and using concat to glue the results 
together using the column names as the keys argument: 


In [68]: result[ 'tlp_pct' ] 
0ut[68] : 


day 

smoker 

count 

mean 

max 

Fri 

No 

4 

0.151650 

0.187735 


Yes 

15 

0.174783 

0.263480 

Sat 

No 

45 

0.158048 

0.291990 


Yes 

42 

0.147906 

0.325733 

Sun 

No 

57 

0.160113 

0.252672 


Yes 

19 

0.187250 

0.710345 

Thur 

No 

45 

0.160298 

0.266312 


Yes 

17 

0.163863 

0.241255 


As before, a list of tuples with custom names can be passed: 

In [69]: ftuples = [( 'Durchschnltt' , 'mean'), (' Abwelchung' , np.var)] 


In [70]: grouped [' tip_pct' , 'total_bill' ]. agg(ftuples) 
Out[70] : 


day 

smoker 

tip_pct 

Durchschnltt Abwelchung 

total_blll 

Durchschnltt 

Abweichung 

Fri 

No 

0.151650 

0.000791 

18.420000 

25.596333 


Yes 

0.174783 

0.002631 

16.813333 

82.562438 

Sat 

No 

0.158048 

0.001581 

19.661778 

79.908965 


Yes 

0.147906 

0.003767 

21.276667 

101.387535 

Sun 

No 

0.160113 

0.001793 

20.506667 

66.099980 


Yes 

0.187250 

0.023757 

24.120000 

109.046044 

Thur 

No 

0.160298 

0.001503 

17.113111 

59.625081 


Yes 

0.163863 

0.001551 

19.190588 

69.808518 
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Now, suppose you wanted to apply potentially different functions to one or more of 
the columns. To do this, pass a diet to agg that contains a mapping of column names 
to any of the function specifications listed so far: 


In [71]: grouped. agg({ 1 tip' : np.max, ’size' : 'sun'}) 
0ut[71] : 




tip size 




day 

snoker 






Fri 

No 

3.50 

9 





Yes 

4.73 

31 




Sat 

No 

9.00 

115 





Yes 

10.00 

104 




Sun 

No 

6.00 

167 





Yes 

6.50 

49 




Thur 

No 

6.70 

112 





Yes 

5.00 

40 




In [72]: grouped. agg({ 'tip_pct' 

: ['min'. 

'max', 'mean'. 





sun' }) 



0ut[72] : 







tip_pct 




size 



min 

max 

mean 

std 

sum 

day 

snoker 






Fri 

No 

0.120385 

0.187735 

0.151650 

0.028123 

9 


Yes 

0.103555 

0.263480 

0.174783 

0.051293 

31 

Sat 

No 

0.056797 

0.291990 

0.158048 

0.039767 

115 


Yes 

0.035638 

0.325733 

0.147906 

0.061375 

104 

Sun 

No 

0.059447 

0.252672 

0.160113 

0.042347 

167 


Yes 

0.065660 

0.710345 

0.187250 

0.154134 

49 

Thur 

No 

0.072961 

0.266312 

0.160298 

0.038774 

112 


Yes 

0.090014 

0.241255 

0.163863 

0.039389 

40 


A DataFrame will have hierarchical columns only if multiple functions are applied to 
at least one column. 

Returning Aggregated Data Without Row Indexes 

In all of the examples up until now, the aggregated data comes back with an index, 
potentially hierarchical, composed from the unique group key combinations. Since 
this isn’t always desirable, you can disable this behavior in most cases by passing 
as_index=False to groupby: 

In [73]: tips.groupby( [ 1 day' , 'snoker'], as_index=False).nean( ) 


0ut[73] : 
day 

snoker 

total_bill 

tip 

size 

tip_pct 

0 

Fri 

No 

18.420000 

2.812500 

2.250000 

0.151650 

1 

Fri 

Yes 

16.813333 

2.714000 

2.066667 

0.174783 

2 

Sat 

No 

19.661778 

3.102889 

2.555556 

0.158048 

3 

Sat 

Yes 

21.276667 

2.875476 

2.476190 

0.147906 

4 

Sun 

No 

20.506667 

3.167895 

2.929825 

0.160113 

5 

Sun 

Yes 

24.120000 

3.516842 

2.578947 

0.187250 
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6 Thur No 17.113111 2.673778 2.488889 0.160298 

7 Thur Yes 19.190588 3.030000 2.352941 0.163863 

Of course, it’s always possible to obtain the result in this format by calling 
reset_index on the result. Using the as_index=False method avoids some unneces¬ 
sary computations. 

10.3 Apply: General split-apply-combine 

The most general-purpose GroupBy method is apply, which is the subject of the rest 
of this section. As illustrated in Figure 10-2, apply splits the object being manipulated 
into pieces, invokes the passed function on each piece, and then attempts to concate¬ 
nate the pieces together. 



Returning to the tipping dataset from before, suppose you wanted to select the top 
five tip_pct values by group. First, write a function that selects the rows with the 
largest values in a particular column: 

In [74]: def top(df, n=5, column=' tip_pct' ): 

....: return df.sort_values(by=column) [ -n :] 


In [75]: top(tips, n=6) 
0ut[75] : 



total_bill 

tip 

smoker 

day 

time 

size 

tip_pct 

109 

14.31 

4.00 

Yes 

Sat 

Dinner 

2 

0.279525 

183 

23.17 

6.50 

Yes 

Sun 

Dinner 

4 

0.280535 

232 

11.61 

3.39 

No 

Sat 

Dinner 

2 

0.291990 
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67 

3.07 

1.00 

Yes 

Sat 

Dinner 

1 

0.325733 

178 

9.60 

4.00 

Yes 

Sun 

Dinner 

2 

0.416667 

172 

7.25 

5.15 

Yes 

Sun 

Dinner 

2 

0.710345 


Now, if we group by smoker, say, and call apply with this function, we get the 
following: 

In [76]: tips,groupby( 'smoker ') .apply(top) 


0ut[76] : 

smoker 

total_bill 

tip 

smoker 

day 

time 

size 

tip_pct 

No 

88 

24.71 

5.85 

No 

Thur 

Lunch 

2 

0.236746 


185 

20.69 

5.00 

No 

Sun 

Dinner 

5 

0.241663 


51 

10.29 

2.60 

No 

Sun 

Dinner 

2 

0.252672 


149 

7.51 

2.00 

No 

Thur 

Lunch 

2 

0.266312 


232 

11.61 

3.39 

No 

Sat 

Dinner 

2 

0.291990 

Yes 

109 

14.31 

4.00 

Yes 

Sat 

Dinner 

2 

0.279525 


183 

23.17 

6.50 

Yes 

Sun 

Dinner 

4 

0.280535 


67 

3.07 

1.00 

Yes 

Sat 

Dinner 

1 

0.325733 


178 

9.60 

4.00 

Yes 

Sun 

Dinner 

2 

0.416667 


172 

7.25 

5.15 

Yes 

Sun 

Dinner 

2 

0.710345 


What has happened here? The top function is called on each row group from the 
DataFrame, and then the results are glued together using pandas. concat, labeling the 
pieces with the group names. The result therefore has a hierarchical index whose 
inner level contains index values from the original DataFrame. 

If you pass a function to apply that takes other arguments or keywords, you can pass 
these after the function: 

In [77]: tips,groupby( [' smoker 1 , 'day 1 ]). apply(top, n=l, column= 1 total_bi.ll' ) 
Out[77] : 





total_bill 

tip 

smoker 

day 

time 

size 

tip_pct 

smoker 

day 









No 

Fri 

94 

22.75 

3.25 

No 

Fri 

Dinner 

2 

0.142857 


Sat 

212 

48.33 

9.00 

No 

Sat 

Dinner 

4 

0.186220 


Sun 

156 

48.17 

5.00 

No 

Sun 

Dinner 

6 

0.103799 


Thur 

142 

41.19 

5.00 

No 

Thur 

Lunch 

5 

0.121389 

Yes 

Fri 

95 

40.17 

4.73 

Yes 

Fri 

Dinner 

4 

0.117750 


Sat 

170 

50.81 

10.00 

Yes 

Sat 

Dinner 

3 

0.196812 


Sun 

182 

45.35 

3.50 

Yes 

Sun 

Dinner 

3 

0.077178 


Thur 

197 

43.11 

5.00 

Yes 

Thur 

Lunch 

4 

0.115982 



Beyond these basic usage mechanics, getting the most out of apply 
may require some creativity. What occurs inside the function 
passed is up to you; it only needs to return a pandas object or a 
scalar value. The rest of this chapter will mainly consist of examples 
showing you how to solve various problems using groupby. 
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You may recall that I earlier called describe on a GroupBy object: 

In [78]: result = tips,groupby( 'smoker 1 )[' tip_pct 1 ] ,describe() 


In [79]: result 
0ut[79] : 


smoker 

count 

mean 

std 

min 

25% 

50% 

75% \ 

No 

151.0 

0.159328 

0.039910 

0.056797 

0.136906 

0.155625 

0.185014 

Yes 

93.0 

0.163196 

0.085119 

0.035638 

0.106771 

0.153846 

0.195059 


max 

smoker 

No 0.291990 

Yes 0.710345 

In [80]: result.unstack( 'smoker' ) 


Out[80] : 



smoker 


count 

No 

151.000000 


Yes 

93.000000 

mean 

No 

0.159328 


Yes 

0.163196 

std 

No 

0.039910 


Yes 

0.085119 

min 

No 

0.056797 


Yes 

0.035638 

25% 

No 

0.136906 


Yes 

0.106771 

50% 

No 

0.155625 


Yes 

0.153846 

75% 

No 

0.185014 


Yes 

0.195059 

max 

No 

0.291990 


Yes 

0.710345 

dtype : 

float64 



Inside GroupBy, when you invoke a method like describe, it is actually just a short¬ 
cut for: 

f = lambda x: x.describeQ 
grouped.apply(f ) 

Suppressing the Group Keys 

In the preceding examples, you see that the resulting object has a hierarchical index 
formed from the group keys along with the indexes of each piece of the original 
object. You can disable this bypassing group_keys=False to groupby: 
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In [81]: tips,groupby( 'smoker' , group_keys=False).apply(top) 


Out [ 81 ]: 



total_bilI 

tip 

smoker 

day 

time 

size 

tip_pct 

88 

24.71 

5.85 

No 

Thur 

Lunch 

2 

0.236746 

185 

20.69 

5.00 

No 

Sun 

Dinner 

5 

0.241663 

51 

10.29 

2.60 

No 

Sun 

Dinner 

2 

0.252672 

149 

7.51 

2.00 

No 

Thur 

Lunch 

2 

0.266312 

232 

11.61 

3.39 

No 

Sat 

Dinner 

2 

0.291990 

109 

14.31 

4.00 

Yes 

Sat 

Dinner 

2 

0.279525 

183 

23.17 

6.50 

Yes 

Sun 

Dinner 

4 

0.280535 

67 

3.07 

1.00 

Yes 

Sat 

Dinner 

1 

0.325733 

178 

9.60 

4.00 

Yes 

Sun 

Dinner 

2 

0.416667 

172 

7.25 

5.15 

Yes 

Sun 

Dinner 

2 

0.710345 


Quantile and Bucket Analysis 

As you may recall from Chapter 8, pandas has some tools, in particular cut and qcut, 
for slicing data up into buckets with bins of your choosing or by sample quantiles. 
Combining these functions with groupby makes it convenient to perform bucket or 
quantile analysis on a dataset. Consider a simple random dataset and an equal-length 
bucket categorization using cut: 

In [82]: frame = pd.DataFrame({' datal' : np.random. randn(1000) , 

....: 'data2': np.random. randn(1000)}) 

In [83]: quartlles = pd.cut(frame.datal, 4) 

In [84]: quartlles[ : 10] 

0ut[84] : 

0 (-1.23, 0.489] 

1 (-2.956, -1.23] 

2 (-1.23, 0.489] 

3 (0.489, 2.208] 

4 (-1.23, 0.489] 

5 (0.489, 2.208] 

6 (-1.23, 0.489] 

7 (-1.23, 0.489] 

8 (0.489, 2.208] 

9 (0.489, 2.208] 

Name: datal, dtype: category 

Categories (4, interval[float64] ): [(-2.956, -1.23] < (-1.23, 0.489] < (0.489, 2. 
208] < (2.208, 3.928]] 

The Categorical object returned by cut can be passed directly to groupby. So we 
could compute a set of statistics for the data2 column like so: 

In [85]: def get_stats(group) : 

....: return {'min': group. min(), 'max': group. max(), 

....: 'count': group.count( ), 'mean': group. mean()} 

In [86]: grouped = frame.data2.groupby(quartiles) 
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In [87]: grouped.apply(get_stats).unstack( ) 
0ut[87] : 



count 

max 

mean 

datal 





min 


(-2.956, -1.23] 
(-1.23, 0.489] 
(0.489, 2.208] 
(2.208, 3.928] 


95.0 1.670835 -0.039521 -3.399312 

598.0 3.260383 -0.002051 -2.989741 

297.0 2.954439 0.081822 -3.745356 

10.0 1.765640 0.024750 -1.929776 


These were equal-length buckets; to compute equal-size buckets based on sample 
quantiles, use qcut. I’ll pass labels=False to just get quantile numbers: 


# Return quantile numbers 

In [88]: grouping = pd.qcut(frame.datal, 10, labels=False) 


In [89]: grouped = frame.data2.groupby(grouping) 


In [90]: grouped.appty(get_stats).unstack( ) 
Out[90] : 



count 

max 

mean 

min 

datal 

0 

100.0 

1.670835 

-0.049902 

-3.399312 

1 

100.0 

2.628441 

0.030989 

-1.950098 

2 

100.0 

2.527939 

-0.067179 

-2.925113 

3 

100.0 

3.260383 

0.065713 

-2.315555 

4 

100.0 

2.074345 

-0.111653 

-2.047939 

5 

100.0 

2.184810 

0.052130 

-2.989741 

6 

100.0 

2.458842 

-0.021489 

-2.223506 

7 

100.0 

2.954439 

-0.026459 

-3.056990 

8 

100.0 

2.735527 

0.103406 

-3.745356 

9 

100.0 

2.377020 

0.220122 

-2.064111 


We will take a closer look at pandas’s Categorical type in Chapter 12. 

Example: Filling Missing Values with Group-Specific Values 

When cleaning up missing data, in some cases you will replace data observations 
using dropna, but in others you may want to impute (fill in) the null (NA) values 
using a fixed value or some value derived from the data, fillna is the right tool to 
use; for example, here I fill in NA values with the mean: 

In [91]: s = pd.Series(np.random.randn(6)) 


In [92]: s[::2] = np.nan 


In [93]: s 
0ut[93] : 

0 NaN 

1 -0.125921 

2 NaN 

3 -0.884475 
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4 NaN 

5 0.227290 
dtype: float64 

In [94]: s.fillna(s.mean()) 

0ut[94] : 

0 -0.261035 

1 -0.125921 

2 -0.261035 

3 -0.884475 

4 -0.261035 

5 0.227290 
dtype: float64 

Suppose you need the fill value to vary by group. One way to do this is to group the 
data and use apply with a function that calls fillna on each data chunk. Here is 
some sample data on US states divided into eastern and western regions: 

In [95]: states = ['Ohio', 'New York', 'Vermont', 'Florida', 

....: 'Oregon', 'Nevada', 'California', 'Idaho'] 

In [96]: group_key = ['East'] * 4 + ['West'] * 4 

In [97]: data = pd.Series(np.random. randn(8) , index=states) 

In [98]: data 
0ut[98] : 

Ohio 0.922264 

New York -2.153545 

Vermont -0.365757 

Florida -0.375842 

Oregon 0.329939 

Nevada 0.981994 

California 1.105913 

Idaho -1.613716 

dtype: float64 

Note that the syntax [' East' ] * 4 produces a list containing four copies of the ele¬ 
ments in [' East 1 ]. Adding lists together concatenates them. 

Let’s set some values in the data to be missing: 

In [99]: data[[ 'Vermont' , 'Nevada', 'Idaho']] = np.nan 


In [100]: 
Out[100] : 
Ohio 

New York 

Vermont 

Florida 

Oregon 

Nevada 

California 


data 


0.922264 

-2.153545 

NaN 

-0.375842 

0.329939 

NaN 

1.105913 
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Idaho NaN 

dtype: float64 

In [101]: data.groupby(group_key),mean( ) 

Out[101]: 

East -0.535707 
West 0.717926 
dtype: float64 

We can fill the NA values using the group means like so: 

In [102]: fill_mean = lambda g: g.flllna(g.mean( )) 


In [103]: data.groupby(group_key).apply(ftll_mean) 
Out[103] : 

Ohio 0.922264 

New York -2.153545 


Vermont -0.535707 
Florida -0.375842 
Oregon 0.329939 
Nevada 0.717926 
California 1.105913 
Idaho 0.717926 
dtype: float64 


In another case, you might have predefined fill values in your code that vary by 
group. Since the groups have a name attribute set internally, we can use that: 


In [104]: fill_values = {'East': 0.5, ’West': -1} 


In [105]: fill_func = lambda g: g.fillna(fill_values[g.name] ) 


In [106]: data.groupby(group_key).apply(flll_func) 
Out[106] : 


Ohio 

New York 

Vermont 

Florida 

Oregon 

Nevada 

California 

Idaho 


0.922264 

-2.153545 

0.500000 

-0.375842 

0.329939 

- 1.000000 

1.105913 

- 1.000000 


dtype: float64 


Example: Random Sampling and Permutation 

Suppose you wanted to draw a random sample (with or without replacement) from a 
large dataset for Monte Carlo simulation purposes or some other application. There 
are a number of ways to perform the “draws”; here we use the sample method for 
Series. 

To demonstrate, here’s a way to construct a deck of English-style playing cards: 


308 | Chapter 10: Data Aggregation and Group Operations 



# Hearts, Spades, Clubs, Diamonds 

suits = [ 'H 1 , 'S', 'C' , ' D ' ] 

card_val = (list(range(l, 11)) + [10] * 3) * 4 

base_names = ['A'] + list(range(2, 11)) + ['J', ' K' , ' Q' ] 

cards = [] 

for suit in ['H', 'S', 'C', 'D' ]: 

cards.extend(str(nun) + suit for nun in base_names) 

deck = pd.Series(card_val, index=cards) 

So now we have a Series of length 52 whose index contains card names and values are 
the ones used in Blackjack and other games (to keep things simple, I just let the ace 
'A' be 1): 


In [108]: deck[ : 13] 
Out[108] : 

AH 

1 

2H 

2 

3H 

3 

4H 

4 

5H 

5 

6H 

6 

7H 

7 

8H 

8 

9H 

9 

10H 

10 

3H 

10 

KH 

10 

QH 

10 

dtype : 

int64 


Now, based on what I said before, drawing a hand of five cards from the deck could 
be written as: 

In [109]: def draw(deck, n=5): 

.: return deck.sample(n) 

In [110]: draw(deck) 

Out[110] : 

AD 1 

8C 8 

5H 5 

KC 10 

2C 2 

dtype: int64 

Suppose you wanted two random cards from each suit. Because the suit is the last 
character of each card name, we can group based on this and use apply: 

In [111]: get_suit = lambda card: card[-l] # last letter is suit 

In [112]: deck.groupby(get_suit),apply(draw, n=2) 

0ut[112] : 
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c 

2C 

2 


3C 

3 

D 

KD 

10 


8D 

8 

H 

KH 

10 


3H 

3 

S 

2S 

2 


4S 

4 


dtype: int64 

Alternatively, we could write: 

In [113]: deck.groupby(get_suit, group_keys=False).apply(draw, n=2) 

0ut[113] : 

KC 10 

3C 10 

AD 1 

5D 5 

5H 5 

6H 6 

7S 7 

KS 10 

dtype: tnt64 

Example: Group Weighted Average and Correlation 

Under the split-apply-combine paradigm of groupby, operations between columns in 
a DataFrame or two Series, such as a group weighted average, are possible. As an 
example, take this dataset containing group keys, values, and some weights: 

In [114]: df = pd.DataFrame({ 'category' : ['a', 'a', 'a', 'a 1 , 

. : 1 b' , 1 b' , 1 b' , 'b 1 ], 

.: 'data': np.random.randn(8), 

.: 'weights': np.random.rand(8)}) 


In [115]: df 
0ut[115]: 



category 

data 

weights 

0 

a 

1.561587 

0.957515 

1 

a 

1.219984 

0.347267 

2 

a 

-0.482239 

0.581362 

3 

a 

0.315667 

0.217091 

4 

b 

-0.047852 

0.894406 

5 

b 

-0.454145 

0.918564 

6 

b 

-0.556774 

0.277825 

7 

b 

0.253321 

0.955905 


The group weighted average by category would then be: 

In [116]: grouped = df.groupby( 'category' ) 

In [117]: get_wavg = lambda g: np. average(g[ 1 data 1 ], weights=g[ 'weights' ]) 
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In [118]: grouped.apply(get_wavg) 

0ut[118]: 
category 
a 0.811643 
b -0.122262 
dtype: float64 

As another example, consider a financial dataset originally obtained from Yahoo! 
Finance containing end-of-day prices for a few stocks and the S&P 500 index (the SPX 
symbol): 

In [119]: close_px = pd.read_csv(' examples/stock_px_2.csv' , parse_dates=True, 
.: index_col=0) 


In [120]: close_px.info() 

<class ' pandas .core.frame.DataFrame 1 > 

Datetimelndex: 2214 entries, 2003-01-02 to 2011-10-14 
Data columns (total 4 columns): 

AAPL 2214 non-null float64 

MSFT 2214 non-null float64 

XOM 2214 non-null float64 

SPX 2214 non-null float64 

dtypes: float64(4) 
memory usage: 86.5 KB 


In [121]: close_px[-4: ] 
0ut[121] : 



AAPL 

MSFT 

XOM 

SPX 

2011-10-11 

400.29 

27.00 

76.27 

1195.54 

2011-10-12 

402.19 

26.96 

77.16 

1207.25 

2011-10-13 

408.43 

27.18 

76.37 

1203.66 

2011-10-14 

422.00 

27.27 

78.11 

1224.58 


One task of interest might be to compute a DataFrame consisting of the yearly corre¬ 
lations of daily returns (computed from percent changes) with SPX. As one way to do 
this, we first create a function that computes the pairwise correlation of each column 
with the 'SPX' column: 

In [122]: spx_corr = lambda x: x.corrwith(x[ 'SPX' ]) 

Next, we compute percent change on close_px using pct_change: 

In [123]: rets = close_px.pct_change().dropna( ) 

Lastly, we group these percent changes by year, which can be extracted from each row 
label with a one-line function that returns the year attribute of each datetime label: 

In [124]: get_year = lambda x: x.year 

In [125]: by_year = rets.groupby(get_year) 

In [126]: by_year.apply(spx_corr) 

0ut[126]: 
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AAPL 

MSFT 

XOM 

SPX 

2003 

0.541124 

0.745174 

0.661265 

1.0 

2004 

0.374283 

0.588531 

0.557742 

1.0 

2005 

0.467540 

0.562374 

0.631010 

1.0 

2006 

0.428267 

0.406126 

0.518514 

1.0 

2007 

0.508118 

0.658770 

0.786264 

1.0 

2008 

0.681434 

0.804626 

0.828303 

1.0 

2009 

0.707103 

0.654902 

0.797921 

1.0 

2010 

0.710105 

0.730118 

0.839057 

1.0 

2011 

0.691931 

0.800996 

0.859975 

1.0 


You could also compute inter-column correlations. Here we compute the annual cor¬ 
relation between Apple and Microsoft: 

In [127]: by_year. apply(lambda g: g[ 'AAPL' ] .corr(g[ 'MSFT 1 ])) 

0ut[127]: 

2003 0.480868 

2004 0.259024 

2005 0.300093 

2006 0.161735 

2007 0.417738 

2008 0.611901 

2009 0.432738 

2010 0.571946 

2011 0.581987 
dtype: float64 


Example: Group-Wise Linear Regression 

In the same theme as the previous example, you can use groupby to perform more 
complex group-wise statistical analysis, as long as the function returns a pandas 
object or scalar value. For example, I can define the following regress function 
(using the statsmodels econometrics library), which executes an ordinary least 
squares (OLS) regression on each chunk of data: 

import statsmodels.api as sm 

def regress(data, yvar, xvars): 

Y = data[yvar] 

X = datafxvars] 

X[' intercept 1 ] = 1. 
result = sm.OLS(Y, X).fit() 
return result.params 

Now, to run a yearly linear regression of AAPL on SPX returns, execute: 

In [129]: by_year.apply( regress, 'AAPL', ['SPX']) 

0ut[129]: 


SPX 

2003 1.195406 

2004 1.363463 

2005 1.766415 

2006 1.645496 


intercept 

0.000710 

0.004201 

0.003246 

0.000080 


312 | Chapter 10: Data Aggregation and Group Operations 



2007 

1.198761 

0.003438 

2008 

0.968016 

-0.001110 

2009 

0.879103 

0.002954 

2010 

1.052608 

0.001261 

2011 

0.806605 

0.001514 


10.4 Pivot Tables and Cross-Tabulation 

A pivot table is a data summarization tool frequently found in spreadsheet programs 
and other data analysis software. It aggregates a table of data by one or more keys, 
arranging the data in a rectangle with some of the group keys along the rows and 
some along the columns. Pivot tables in Python with pandas are made possible 
through the groupby facility described in this chapter combined with reshape opera¬ 
tions utilizing hierarchical indexing. DataFrame has a pivot_table method, and 
there is also a top-level pandas.pivot_table function. In addition to providing a 
convenience interface to groupby, pivot_table can add partial totals, also known as 
margins. 

Returning to the tipping dataset, suppose you wanted to compute a table of group 
means (the default pivot_table aggregation type) arranged by day and smoker on 
the rows: 


In [130]: tips.pivot_table(index: [' day' , 'smoker']) 
0ut[130] : 


day 

smoker 

size 

tip 

tip_pct 

total_bill 

Fri 

No 

2.250000 

2.812500 

0.151650 

18.420000 


Yes 

2.066667 

2.714000 

0.174783 

16.813333 

Sat 

No 

2.555556 

3.102889 

0.158048 

19.661778 


Yes 

2.476190 

2.875476 

0.147906 

21.276667 

Sun 

No 

2.929825 

3.167895 

0.160113 

20.506667 


Yes 

2.578947 

3.516842 

0.187250 

24.120000 

Thur 

No 

2.488889 

2.673778 

0.160298 

17.113111 


Yes 

2.352941 

3.030000 

0.163863 

19.190588 


This could have been produced with groupby directly. Now, suppose we want to 
aggregate only tip_pct and size, and additionally group by time. I’ll put smoker in 
the table columns and day in the rows: 

In [131]: tips.pivot_tab!e( [ 'tip_pct' , 'size'], index= [ 'time' , 'day'], 

.: columns='smoker') 

0ut[131]: 


smoker 


size 

No 

Yes 

tip_pct 

No 

Yes 

time 

Dinner 

day 

Fri 

2.000000 

2.222222 

0.139622 

0.165347 


Sat 

2.555556 

2.476190 

0.158048 

0.147906 


Sun 

2.929825 

2.578947 

0.160113 

0.187250 


Thur 

2.000000 

NaN 

0.159744 

NaN 
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Lunch Fri 3.000000 1.833333 0.187735 0.188937 

Thur 2.500000 2.352941 0.160311 0.163863 

We could augment this table to include partial totals by passing narglns=True. This 
has the effect of adding All row and column labels, with corresponding values being 
the group statistics for all the data within a single tier: 


In [132]: tips.pivot_table( [' tip_pct' , 'size'], lndex=[ ! time' , 'day'], 

.: columns='smoker', marglns=True) 

0ut[132] : 


smoker 


size 

No 

Yes 

All 

tlp_pct 

No 

Yes 

All 

time 

Dinner 

day 

Frl 

2.000000 

2.222222 

2.166667 

0.139622 

0.165347 

0.158916 


Sat 

2.555556 

2.476190 

2.517241 

0.158048 

0.147906 

0.153152 


Sun 

2.929825 

2.578947 

2.842105 

0.160113 

0.187250 

0.166897 


Thur 

2.000000 

NaN 

2.000000 

0.159744 

NaN 

0.159744 

Lunch 

Frl 

3.000000 

1.833333 

2.000000 

0.187735 

0.188937 

0.188765 


Thur 

2.500000 

2.352941 

2.459016 

0.160311 

0.163863 

0.161301 

All 


2.668874 

2.408602 

2.569672 

0.159328 

0.163196 

0.160803 


Here, the All values are means without taking into account smoker versus non- 
smoker (the All columns) or any of the two levels of grouping on the rows (the All 
row). 

To use a different aggregation function, pass it to aggfunc. For example, 'count' or 
len will give you a cross-tabulation (count or frequency) of group sizes: 

In [133]: tips.plvot_table( 'tlp_pct' , lndex=[' time' , 'smoker'], columns= 'day' , 

.: aggfunc=len, marglns=True) 

0ut[133]: 


day 

time 

smoker 

Frl 

Sat 

Sun 

Thur 

All 

Dinner 

No 

3.0 

45.0 

57.0 

1.0 

106.0 


Yes 

9.0 

42.0 

19.0 

NaN 

70.0 

Lunch 

No 

1.0 

NaN 

NaN 

44.0 

45.0 


Yes 

6.0 

NaN 

NaN 

17.0 

23.0 

All 


19.0 

87.0 

76.0 

62.0 

244.0 


If some combinations are empty (or otherwise NA), you may wish to pass a 
fill value: 


In [134]: 

tips.plvot_table(' 

tlp_pct', lndex=[’ time' , 'size 

0ut[134] : 


columns= 'day 

, aggfunc= 

='mean' , 

day 

time size smoker 

Frl 

Sat 

Sun 

Thur 

Dinner 1 

No 

0.000000 

0.137931 

0.000000 

0.000000 


Yes 

0.000000 

0.325733 

0.000000 

0.000000 

2 

No 

0.139622 

0.162705 

0.168859 

0.159744 


Yes 

0.171297 

0.148668 

0.207893 

0.000000 

3 

No 

0.000000 

0.154661 

0.152663 

0.000000 


'smoker'], 
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Yes 

0.000000 

0.144995 

0.152660 

0.000000 

4 

No 

0.000000 

0.150096 

0.148143 

0.000000 


Yes 

0.117750 

0.124515 

0.193370 

0.000000 

5 

No 

0.000000 

0.000000 

0.206928 

0.000000 


Yes 

0.000000 

0.106572 

0.065660 

0.000000 

Lunch 1 

No 

0.000000 

0.000000 

0.000000 

0.181728 


Yes 

0.223776 

0.000000 

0.000000 

0.000000 

2 

No 

0.000000 

0.000000 

0.000000 

0.166005 


Yes 

0.181969 

0.000000 

0.000000 

0.158843 

3 

No 

0.187735 

0.000000 

0.000000 

0.084246 


Yes 

0.000000 

0.000000 

0.000000 

0.204952 

4 

No 

0.000000 

0.000000 

0.000000 

0.138919 


Yes 

0.000000 

0.000000 

0.000000 

0.155410 

5 

No 

0.000000 

0.000000 

0.000000 

0.121389 

6 

No 

r A . in 

0.000000 

0.000000 

0.000000 

0.173706 


[21 rows x 4 columns] 


See Table 10-2 for a summary of pivot_table methods. 


Table 10-2. pivotJtable options 


Function name Description 


values 

index 

columns 

aggfunc 

flll_value 

dropna 

margins 


Column name or names to aggregate; by default aggregates all numeric columns 
Column names or other group keys to group on the rows of the resulting pivot table 
Column names or other group keys to group on the columns of the resulting pivot table 
Aggregation function or list of functions ( 1 mean' by default); can be any function valid in a groupby 
context 

Replace missing values in result table 

If True, do not include columns whose entries are all NA 

Add row/column subtotals and grand total (False by default) 


Cross-Tabulations: Crosstab 

A cross-tabulation (or crosstab for short) is a special case of a pivot table that com¬ 
putes group frequencies. Here is an example: 


In [138]: data 
Out[138]: 



Sample Nationality 

Handedness 

0 

1 

USA 

Right-handed 

1 

2 

Japan 

Left-handed 

2 

3 

USA 

Right-handed 

3 

4 

Japan 

Right-handed 

4 

5 

Japan 

Left-handed 

5 

6 

Japan 

Right-handed 

6 

7 

USA 

Right-handed 

7 

8 

USA 

Left-handed 

8 

9 

Japan 

Right-handed 

9 

10 

USA 

Right-handed 
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As part of some survey analysis, we might want to summarize this data by nationality 
and handedness. You could use pivot_table to do this, but the pandas.crosstab 
function can be more convenient: 


In [139]: pd.crosstab(data.Nationality , data.Handedness, margins=True) 
0ut[139] : 

Handedness Left-handed Right-handed All 
Nationality 

Japan 2 35 

USA 1 45 

All 3 7 10 


The first two arguments to crosstab can each either be an array or Series or a list of 
arrays. As in the tips data: 

In [140]: pd.crosstab([tips.time, tips. day], tips.smoker, margins=True) 

Out [140] : 


smoker 

time 

day 

No 

Yes 

All 

Dinner 

Fri 

3 

9 

12 


Sat 

45 

42 

87 


Sun 

57 

19 

76 


Thur 

1 

0 

1 

Lunch 

Fri 

1 

6 

7 


Thur 

44 

17 

61 

All 


151 

93 

244 


10.5 Conclusion 

Mastering pandas’s data grouping tools can help both with data cleaning as well as 
modeling or statistical analysis work. In Chapter 14 we will look at several more 
example use cases for groupby on real data. 

In the next chapter, we turn our attention to time series data. 
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CHAPTER 11 


Time Series 


Time series data is an important form of structured data in many different fields, such 
as finance, economics, ecology, neuroscience, and physics. Anything that is observed 
or measured at many points in time forms a time series. Many time series are fixed 
frequency, which is to say that data points occur at regular intervals according to some 
rule, such as every 15 seconds, every 5 minutes, or once per month. Time series can 
also be irregular without a fixed unit of time or offset between units. How you mark 
and refer to time series data depends on the application, and you may have one of the 
following: 

• Timestamps, specific instants in time 

• Fixed periods, such as the month January 2007 or the full year 2010 

• Intervals of time, indicated by a start and end timestamp. Periods can be thought 
of as special cases of intervals 

• Experiment or elapsed time; each timestamp is a measure of time relative to a 
particular start time (e.g., the diameter of a cookie baking each second since 
being placed in the oven) 

In this chapter, I am mainly concerned with time series in the first three categories, 
though many of the techniques can be applied to experimental time series where the 
index may be an integer or floating-point number indicating elapsed time from the 
start of the experiment. The simplest and most widely used kind of time series are 
those indexed by timestamp. 
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pandas also supports indexes based on timedeltas, which can be a 
useful way of representing experiment or elapsed time. We do not 
explore timedelta indexes in this book, but you can learn more in 
the pandas documentation. 


pandas provides many built-in time series tools and data algorithms. You can effi¬ 
ciently work with very large time series and easily slice and dice, aggregate, and 
resample irregular- and fixed-frequency time series. Some of these tools are especially 
useful for financial and economics applications, but you could certainly use them to 
analyze server log data, too. 

11.1 Date and Time Data Types and Tools 

The Python standard library includes data types for date and time data, as well as 
calendar-related functionality. The datetine, tine, and calendar modules are the 
main places to start. The datetine.datetine type, or simply datetine, is widely 
used: 

In [10]: from import datetime 

In [11]: now = datetime. now( ) 

In [12]: now 

0ut[12] : datetime. datetime(2017, 9, 25, 14, 5, 52, 72973) 

In [13]: now. year, now.month, now.day 
0ut[13] : (2017, 9, 25) 

datetine stores both the date and time down to the microsecond, tinedelta repre¬ 
sents the temporal difference between two datetine objects: 

In [14]: delta = datetime(2011, 1, 7) - datetime(2008, 6, 24, 8, 15) 

In [15]: delta 

0ut[15]: datetime. timedelta(926, 56700) 

In [16]: delta.days 
0ut[16] : 926 

In [17]: delta.seconds 
0ut[17] : 56700 

You can add (or subtract) a tinedelta or multiple thereof to a datetine object to 
yield a new shifted object: 

In [18]: from import timedelta 

In [19]: start = datetime(2011, 1, 7) 
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In [20]: start + tlmedelta(12) 

Out[20]: datetime.datetime(2011, 1, 19, 0, 0) 

In [21]: start - 2 * timedelta(12) 

Out[21]: datetime.datetime(2010, 12, 14, 0, 0) 

Table 11-1 summarizes the data types in the datetime module. While this chapter is 
mainly concerned with the data types in pandas and higher-level time series manipu¬ 
lation, you may encounter the datetime-based types in many other places in Python 
in the wild. 

Table 11-1. Types in datetime module 


Type Description 


date Store calendar date (year, month, day) using the Gregorian calendar 

time Store time of day as hours, minutes, seconds, and microseconds 

datetime Stores both date and time 

timedelta Represents the difference between two datetime values (as days, seconds, and microseconds) 
tzinfo Base type for storing time zone information 

Converting Between String and Datetime 

You can format datetime objects and pandas Timestamp objects, which I’ll introduce 
later, as strings using str or the strftime method, passing a format specification: 

In [22]: stamp = datetime(2011, 1, 3) 

In [23]: str(stamp) 

0ut[23] : '2011-01-03 00:00:00’ 

In [24]: stamp.strftimef '%Y-%m-%d' ) 

0ut[24] : '2011-01-03' 

See Table 11-2 for a complete list of the format codes (reproduced from Chapter 2). 
Table 11-2. Datetime format specification (ISO C89 compatible) 


Type Description 


%Y Four-digit year 

%y Two-digit year 

%m Two-digit month [01,12] 

%d Two-digit day [01,31] 

%H Hour (24-hour clock) [00,23] 

%I Hour (12-hour clock) [01,12] 

%M Two-digit minute [00,59] 

%S Second [00,61] (seconds 60,61 account for leap seconds) 
%w Weekday as integer [0 (Sunday), 6] 
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Type Description 


%U Week number of the year [00,53]; Sunday is considered the first day of the week, and days before the first Sunday of 
the year are "week 0" 

%W Week number of the year [00,53]; Monday is considered the first day of the week, and days before the first Monday of 
the year are "week 0" 

%z UTC time zone offset as +HHMM or - HHMM; empty if time zone naive 
%F Shortcut for 7oY-%m-%d (e.g., 2012-4-18) 

%D Shortcut for %m/%d/%y (e.g., 04/18/12) 


You can use these same format codes to convert strings to dates using date 
time, strptime: 

In [25]: value = '2011-01-03' 

In [26]: datetime.strptime(value, '%Y-%m-%d') 

Out[26]: datetime. datetime(2011, 1, 3, 0, 0) 

In [27]: datestrs = ['7/6/2011', '8/6/2011'] 

In [28]: [datetime. strptimefx, '%m/%d/%Y') for x in datestrs] 

Out[28] : 

[datetime. datetime(2011, 7, 6, 0, 0), 
datetime. datetime(2011, 8, 6, 0, 0)] 

datetime, strptime is a good way to parse a date with a known format. However, it 
can be a bit annoying to have to write a format spec each time, especially for common 
date formats. In this case, you can use the parser.parse method in the third-party 
dateutil package (this is installed automatically when you install pandas): 

In [29]: from import parse 

In [30]: parse( 1 2011-01-03 1 ) 

Out[30]: datetime. datetime(2011, 1, 3, 0, 0) 

dateutil is capable of parsing most human-intelligible date representations: 

In [31]: parse( 1 Jan 31, 1997 10:45 PM') 

Out[31]: datetime. datetime(1997, 1, 31, 22, 45) 

In international locales, day appearing before month is very common, so you can pass 
dayfirst=True to indicate this: 

In [32]: parse( 1 6/12/2011 1 , dayfirst=True) 

Out[32]: datetime. datetime(2011, 12, 6, 0, 0) 

pandas is generally oriented toward working with arrays of dates, whether used as an 
axis index or a column in a DataFrame. The to_datetime method parses many dif¬ 
ferent kinds of date representations. Standard date formats like ISO 8601 can be 
parsed very quickly: 
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In [33]: datestrs = ['2011-07-06 12:00:00', '2011-08-06 00:00:00'] 


In [34]: pd.to_datetime(datestrs) 

0ut[34] : Datetimelndex([ ’2011-07-06 12:00:00', '2011-08-06 00:00:00'], dtype='dat 
etime64[ns]' , freq=None) 

It also handles values that should be considered missing (None, empty string, etc.): 

In [35]: idx = pd.to_datetine(datestrs + [None]) 

In [36]: idx 

0ut[36] : Datetlmelndex([ ’2011-07-06 12:00:00' , '2011-08-06 00:00:00', 'NaT' ], dty 
pe= 'datetime64[ns]' , freq=None) 

In [37]: idx[2] 

0ut[37] : NaT 

In [38]: pd.isnull(idx) 

0ut[38]: array( [False, False, True], dtype=bool) 

NaT (Not a Time) is pandas’s null value for timestamp data. 



dateutil. parser is a useful but imperfect tool. Notably, it will rec¬ 
ognize some strings as dates that you might prefer that it didn’t— 
for example, ' 42' will be parsed as the year 2042 with today’s cal¬ 
endar date. 


datetime objects also have a number of locale-specific formatting options for systems 
in other countries or languages. For example, the abbreviated month names will be 
different on German or French systems compared with English systems. See 
Table 11-3 for a listing. 

Table 11-3. Locale-specific date formatting 


Type Description 


%a Abbreviated weekday name 

%A Full weekday name 

%b Abbreviated month name 

%B Full month name 

%c Full date and time (e.g., 'Tue 01 May 2012 04:20:57 PM') 

%p Locale equivalent of AM or PM 

%x Locale-appropriate formatted date (e.g., in the United States, May 1,2012 yields '05/01/2012') 

%X Locale-appropriate time (e.g., '04:24:12 PM') 
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11.2 Time Series Basics 

A basic kind of time series object in pandas is a Series indexed by timestamps, which 
is often represented external to pandas as Python strings or datetime objects: 

In [39]: from import datetime 

In [40]: dates = [datetime(2011, 1, 2), datetime(2011, 1, 5), 
datetime(2011, 1, 7), datetime(2011, 1, 8), 
datetime(2011, 1, 10), datetime(2011, 1, 12)] 


In [41]: ts 

= pd.Series(np.random.randn(6), index=dates) 

In [42]: ts 


0ut[42] : 


2011-01-02 

-0.204708 

2011-01-05 

0.478943 

2011-01-07 

-0.519439 

2011-01-08 

-0.555730 

2011-01-10 

1.965781 

2011-01-12 

1.393406 

dtype: float64 


Under the hood, these datetime objects have been put in a Datetimelndex: 

In [43]: ts.index 
0ut[43] : 

Datetimelndex( [' 2011-01-02' , ’2011-01-05', ’2011-01-07', '2011-01-08', 
' 2011 - 01 - 10 ', ' 2011 - 01 - 12 '], 
dtype= 1 datetime64[ns] 1 , freq=None) 

Like other Series, arithmetic operations between differently indexed time series auto¬ 
matically align on the dates: 


In [44]: ts 
0ut[44] : 

+ ts[:: 2] 

2011-01-02 

-0.409415 

2011-01-05 

NaN 

2011-01-07 

-1.038877 

2011-01-08 

NaN 

2011-01-10 

3.931561 

2011-01-12 

NaN 

dtype: float64 


Recall that ts [:: 2] selects every second element in ts. 

pandas stores timestamps using NumPy’s datetime64 data type at the nanosecond 
resolution: 

In [45]: ts.index.dtype 
0ut[45]: dtype( 1 <M8[ns] 1 ) 

Scalar values from a Datetimelndex are pandas Timestamp objects: 
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In [46]: stamp = ts.index[0] 


In [47]: stamp 

0ut[47] : Timestamp( '2011-01-02 00:00:00') 

A Timestamp can be substituted anywhere you would use a datetime object. Addi¬ 
tionally, it can store frequency information (if any) and understands how to do time 
zone conversions and other kinds of manipulations. More on both of these things 
later. 

Indexing, Selection, Subsetting 

Time series behaves like any other pandas.Series when you are indexing and select¬ 
ing data based on label: 

In [48]: stamp = ts.index[2] 

In [49]: ts[stamp] 

0ut[49] : -0.51943871505673811 

As a convenience, you can also pass a string that is interpretable as a date: 

In [50]: ts[' 1/10/2011' ] 

Out[50] : 1.9657805725027142 

In [51]: ts[ '20110110' ] 

0ut[51] : 1.9657805725027142 

For longer time series, a year or only a year and month can be passed to easily select 
slices of data: 

In [52]: longer_ts = pd.Sertes(np.random. randn(1000) , 

....: index=pd.date_range( '1/1/2000' , periods=1000) ) 


longer_ts 


In [53]: 
0ut[53] : 
2000 - 01-01 
2000 - 01-02 
2000-01-03 
2000-01-04 
2000-01-05 
2000-01-06 
2000-01-07 
2000-01-08 
2000-01-09 
2000 - 01-10 

2002-09-17 

2002-09-18 

2002-09-19 

2002-09-20 

2002-09-21 


0.092908 

0.281746 

0.769023 

1.246435 

1.007189 

-1.296221 

0.274992 

0.228913 

1.352917 

0.886429 

-0.139298 

-1.159926 

0.618965 

1.373890 

-0.983505 
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2002-09-22 

2002-09-23 

2002-09-24 

2002-09-25 

2002-09-26 


0.930944 

-0.811676 

-1.830156 

-0.138730 

0.334088 


Freq: D, Length: 1000, dtype: float64 
ts[' 2001' ] 


In [54]: longer 
0ut[54] : 
2001 - 01-01 
2001 - 01-02 
2001-01-03 
2001-01-04 
2001-01-05 
2001-01-06 
2001-01-07 
2001-01-08 
2001-01-09 
2001 - 01-10 


599534 

474071 

151326 

542173 

475496 

106403 

308228 

173185 

564561 

190481 


2001 - 
2001 
2001 - 
2001 - 
2001 - 
2001 - 
2001 
2001 - 
2001 - 
2001 
Freq : 


12-22 

12-23 

12-24 

12-25 

12-26 

12-27 

12-28 

12-29 

12-30 

12-31 


0.000369 

0.900885 

-0.454869 

-0.864547 

1.129120 

0.057874 

-0.433739 

0.092698 

-1.397820 

1.457823 


D, Length: 365, dtype: float64 


Here, the string 1 2001' is interpreted as a year and selects that time period. This also 
works if you specify the month: 


In [55]: longer_ts[' 2001-05' ] 
0ut[55] : 

2001-05-01 -0.622547 

2001-05-02 

0.936289 

2001-05-03 

0.750018 

2001-05-04 

-0.056715 

2001-05-05 

2.300675 

2001-05-06 

0.569497 

2001-05-07 

1.489410 

2001-05-08 

1.264250 

2001-05-09 

-0.761837 

2001-05-10 

-0.331617 

2001-05-22 

0.503699 

2001-05-23 

-1.387874 

2001-05-24 

0.204851 

2001-05-25 

0.603705 

2001-05-26 

0.545680 
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2001-05-27 0.235477 

2001-05-28 0.111835 

2001-05-29 -1.251504 

2001-05-30 -2.949343 

2001-05-31 0.634634 

Freq: D, Length: 31, dtype: float64 

Slicing with datetime objects works as well: 


In [56]: ts[datetime(2011, 1, 7):] 

0ut[56] : 

2011-01-07 -0.519439 

2011-01-08 -0.555730 

2011-01-10 1.965781 

2011-01-12 1.393406 

dtype: float64 

Because most time series data is ordered chronologically, you can slice with time- 
stamps not contained in a time series to perform a range query: 


In [57]: ts 
0ut[57] : 
2011 - 01-02 
2011-01-05 
2011-01-07 
2011-01-08 
2011 - 01-10 
2011 - 01-12 
dtype : 


-0.204708 

0.478943 

-0.519439 

-0.555730 


1.965781 
1.393406 

float64 


In [58]: ts[' 1/6/20111/11/2011' ] 

0ut[58] : 

2011-01-07 -0.519439 

2011-01-08 -0.555730 

2011-01-10 1.965781 

dtype: float64 

As before, you can pass either a string date, datetime, or timestamp. Remember that 
slicing in this manner produces views on the source time series like slicing NumPy 
arrays. This means that no data is copied and modifications on the slice will be reflec¬ 
ted in the original data. 

There is an equivalent instance method, truncate, that slices a Series between two 
dates: 

In [59]: ts.truncate(after= '1/9/2011' ) 

0ut[59] : 

2011-01-02 -0.204708 

2011-01-05 0.478943 

2011-01-07 -0.519439 

2011-01-08 -0.555730 

dtype: float64 
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All of this holds true for DataFrame as well, indexing on its rows: 

In [60]: dates = pd.date_range( '1/1/2000' , periods=100, f req=' kl-WED' ) 

In [61]: long_df = pd.DataFrame(np.random. randn(100, 4), 

....: index=dates, 

coIupins=[ 'Colorado' , 'Texas', 

. . .. : 'New York', 'Ohio']) 

In [62]: long_df. toc[ '5-2001' ] 

0ut[62] : 

Colorado Texas New York Ohio 
2001-05-02 -0.006045 0.490094 -0.277186 -0.707213 

2001-05-09 -0.560107 2.735527 0.927335 1.513906 

2001-05-16 0.538600 1.273768 0.667876 -0.969206 

2001-05-23 1.676091 -0.817649 0.050188 1.951312 

2001-05-30 3.260383 0.963301 1.201206 -1.852001 

Time Series with Duplicate Indices 

In some applications, there may be multiple data observations falling on a particular 
timestamp. Here is an example: 

In [63]: dates = pd.Datetimelndex( [' 1/1/2000 1 , '1/2/2000', '1/2/2000', 

_: ’1/2/2000', '1/3/2000']) 

In [64]: dup_ts = pd.Series(np.arange(5) , index=dates) 

In [65]: dup_ts 
0ut[65] : 

2000 - 01-01 0 
2000 - 01-02 1 
2000 - 01-02 2 
2000-01-02 3 

2000-01-03 4 

dtype: tnt64 

We can tell that the index is not unique by checking its is_unique property: 

In [66]: dup_ts.index.is_untque 
0ut[66]: False 

Indexing into this time series will now either produce scalar values or slices depend¬ 
ing on whether a timestamp is duplicated: 

In [67]: dup_ts[ '1/3/2000' ] # not duplicated 

0ut[67] : 4 

In [68]: dup_ts[ '1/2/2000' ] # duplicated 

0ut[68] : 

2000 - 01-02 1 
2000 - 01-02 2 
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2000-01-02 3 

dtype: Int64 

Suppose you wanted to aggregate the data having non-unique timestamps. One way 
to do this is to use groupby and pass level=0: 

In [69]: grouped = dup_ts.groupby(tevel=0) 

In [70]: grouped,mean( ) 

Out[70] : 

2000 - 01-01 0 
2000 - 01-02 2 
2000-01-03 4 

dtype: tnt64 

In [71]: grouped.count( ) 

0ut[71] : 

2000 - 01-01 1 
2000-01-02 3 

2000-01-03 1 

dtype: tnt64 

11.3 Date Ranges, Frequencies, and Shifting 

Generic time series in pandas are assumed to be irregular; that is, they have no fixed 
frequency. For many applications this is sufficient. However, it’s often desirable to 
work relative to a fixed frequency, such as daily, monthly, or every 15 minutes, even if 
that means introducing missing values into a time series. Fortunately pandas has a 
full suite of standard time series frequencies and tools for resampling, inferring fre¬ 
quencies, and generating fixed-frequency date ranges. For example, you can convert 
the sample time series to be fixed daily frequency by calling resample: 

In [72]: ts 
0ut[72] : 

2011-01-02 -0.204708 

2011-01-05 0.478943 

2011-01-07 -0.519439 

2011-01-08 -0.555730 

2011-01-10 1.965781 

2011-01-12 1.393406 

dtype: float64 

In [73]: resampler = ts.resample(' D' ) 

The string ' D' is interpreted as daily frequency. 

Conversion between frequencies or resampling is a big enough topic to have its own 
section later (Section 11.6, “Resampling and Frequency Conversion,” on page 348). 
Here I’ll show you how to use the base frequencies and multiples thereof. 


11.3 Date Ranges, Frequencies, and Shifting | 327 



Generating Date Ranges 

While I used it previously without explanation, pandas .date_range is responsible for 
generating a Datetimelndex with an indicated length according to a particular 
frequency: 

In [74]: index = pd.date_range( '2012-04-01' , '2012-06-01') 


In [75]: index 
0ut[75] : 


Datetimelndex( [' 2012-04-01' , 

’2012-04-02' , 

'2012-04-03' , 

'2012-04-04 

'2012-04-05' , 

’2012-04-06' , 

’2012-04-07' , 

'2012-04-08 

'2012-04-09' , 

'2012-04-10' , 

'2012-04-11' , 

'2012-04-12 

’2012-04-13' , 

’2012-04-14' , 

'2012-04-15' , 

'2012-04-16 

’2012-04-17' , 

’2012-04-18' , 

’2012-04-19' , 

'2012-04-20 

'2012-04-21' , 

'2012-04-22' , 

'2012-04-23' , 

'2012-04-24 

’2012-04-25' , 

’2012-04-26' , 

’2012-04-27' , 

'2012-04-28 

’2012-04-29' , 

'2012-04-30' , 

’2012-05-01' , 

'2012-05-02 

’2012-05-03' , 

’2012-05-04' , 

'2012-05-05' , 

'2012-05-06 

’2012-05-07' , 

’2012-05-08' , 

’2012-05-09' , 

'2012-05-10 

’2012-05-11' , 

’2012-05-12' , 

'2012-05-13' , 

'2012-05-14 

’2012-05-15' , 

’2012-05-16' , 

'2012-05-17' , 

'2012-05-18 

’2012-05-19' , 

’2012-05-20' , 

’2012-05-21' , 

'2012-05-22 

’2012-05-23' , 

’2012-05-24' , 

’2012-05-25' , 

'2012-05-26 

’2012-05-27' , 

’2012-05-28' , 

’2012-05-29' , 

'2012-05-30 

’2012-05-31' , 

’2012-06-01' ]. 

» 



dtype= 'datetime64[ns]' , freq= 'D' ) 


By default, date_range generates daily timestamps. If you pass only a start or end 
date, you must pass a number of periods to generate: 


In [76]: pd.date_range(start='2012-04-01' , periods=20) 
0ut[76] : 

Datetimelndex([ ’2012-04-01' , ’2012-04-02', ’2012-04-03', 
’2012-04-05', ’2012-04-06', ’2012-04-07', 
’2012-04-09', ’2012-04-10', ’2012-04-11', 
’2012-04-13', ’2012-04-14', ’2012-04-15', 
’2012-04-17', ’2012-04-18', ’2012-04-19', 
dtype= ’datetime64[ns] ’ , freq= ’D’ ) 


'2012-04-04' , 
'2012-04-08' , 
'2012-04-12' , 
'2012-04-16' , 
'2012-04-20'], 


In [77]: pd.date_range(end= '2012-06-01' , periods=20) 
0ut[77] : 

Datetimelndex([ '2012-05-13' , '2012-05-14', '2012-05-15', 
'2012-05-17', ’2012-05-18', ’2012-05-19', 
’2012-05-21', ’2012-05-22', ’2012-05-23', 
’2012-05-25', ’2012-05-26', ’2012-05-27', 
’2012-05-29', ’2012-05-30', ’2012-05-31', 
dtype= 'datetime64[ns]' , freq= 'D' ) 


'2012-05-16' , 
'2012-05-20' , 
'2012-05-24' , 
'2012-05-28' , 
'2012-06-01'], 


The start and end dates define strict boundaries for the generated date index. For 
example, if you wanted a date index containing the last business day of each month, 
you would pass the 1 BM 1 frequency (business end of month; see more complete listing 


328 | Chapter 11: Time Series 





of frequencies in Table 11-4) and only dates falling on or inside the date interval will 
be included: 

In [78]: pd,date_range( '2000-01-01'2000-12-01', freq='BM') 

0ut[78] : 

Datetinelndex([ '2000-01-31' , '2000-02-29', '2000-03-31', '2000-04-28', 
’2000-05-31', ’2000-06-30', ’2000-07-31', '2000-08-31', 
'2000-09-29', ’2000-10-31', ’2000-11-30'], 
dtype= 1 datetime64[ns] 1 , freq= 1 BM' ) 


Table 11-4. Base time series frequencies (not comprehensive) 


1 Alias 

Offset type 

Description 1 

D 

Day 

Calendar daily 

B 

BustnessDay 

Business daily 

H 

Hour 

Hourly 

T or min 

Minute 

Minutely 

S 

Second 

Secondly 

Lor ms 

Mill! 

Millisecond (1/1,000 of 1 second) 

U 

Micro 

Microsecond (1/1,000,000 of 1 second) 

M 

MonthEnd 

Last calendar day of month 

BM 

BusinessMonthEnd 

Last business day (weekday) of month 

MS 

MonthBegin 

First calendar day of month 

BMS 

BusinessMonthBegin 

First weekday of month 

W-MON, W-TUE, ... 

Week 

Weekly on given day of week (MON, 1UE, WED, 1HU, 

FRI, SAL, or SUN) 

WOM-1MON, W0M-2M0N, .. 

. WeekOfMonth 

Generate weekly dates in the first, second, third, or 
fourth week of the month (e.g., WOM-3FRI for the 
third Friday of each month) 

Q-JAN, Q-FEB, . . . 

QuarterEnd 

Quarterly dates anchored on last calendar day of each 
month, for year ending in indicated month (JAN, FEB, 
MAR, APR, MAY, JUN, JUL, AUG, SEP, OCT, NOV, or DEC) 

BQ-JAN, BQ-FEB, ... 

BusinessQuarterEnd 

Quarterly dates anchored on last weekday day of each 
month, for year ending in indicated month 

QS-JAN, QS-FEB, ... 

QuarterBegin 

Quarterly dates anchored on first calendar day of each 
month, for year ending in indicated month 

BQS-3AN, BQS-FEB, ... 

BusinessQuarterBegin 

Quarterly dates anchored on first weekday day of each 
month, for year ending in indicated month 

A-JAN, A-FEB, . . . 

YearEnd 

Annual dates anchored on last calendar day of given 
month (JAN, FEB, MAR, APR, MAY, JUN, JUL, AUG, SEP, 
OCL, NOV, or DEC) 

BA-JAN, BA-FEB, ... 

BusinessYearEnd 

Annual dates anchored on last weekday of given 
month 

AS-JAN, AS-FEB, . . . 

YearBegin 

Annual dates anchored on first day of given month 

BAS-JAN, BAS-FEB, ... 

BusinessYearBegin 

Annual dates anchored on first weekday of given 
month 
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date_range by default preserves the time (if any) of the start or end timestamp: 

In [79]: pd.date_range( '2012-05-02 12:56:31', periods=5) 

0ut[79] : 

Datetimelndex( [' 2012-05-02 12:56:31', '2012-05-03 12:56:31', 

'2012-05-04 12:56:31', '2012-05-05 12:56:31', 

'2012-05-06 12:56:31' ], 
dtype= 'datetime64[ns]' , freq='D' ) 

Sometimes you will have start or end dates with time information but want to gener¬ 
ate a set of timestamps normalized to midnight as a convention. To do this, there is a 
normalize option: 

In [80]: pd.date_range( '2012-05-02 12:56:31', pertods=5, nomatize=True) 

Out[80] : 

Datetimelndex([ '2012-05-02' , '2012-05-03', ’2012-05-04', '2012-05-05', 
'2012-05-06'], 

dtype='datetime64[ns] ' , freq='D' ) 

Frequencies and Date Offsets 

Frequencies in pandas are composed of a base frequency and a multiplier. Base fre¬ 
quencies are typically referred to by a string alias, like ' M' for monthly or 1 H' for 
hourly. For each base frequency, there is an object defined generally referred to as a 
date offset. For example, hourly frequency can be represented with the Hour class: 

In [81]: from import Hour, Minute 

In [82]: hour = Hour( ) 

In [83]: hour 
0ut[83]: <Hour> 

You can define a multiple of an offset by passing an integer: 

In [84]: four_hours = Hour(4) 

In [85]: four_hours 
0ut[85] : <4 * Hours> 

In most applications, you would never need to explicitly create one of these objects, 
instead using a string alias like ' H' or ' 4H'. Putting an integer before the base fre¬ 
quency creates a multiple: 

In [86]: pd,date_range( 1 2000-01-01' , '2000-01-03 23:59', freq='4h‘) 

0ut[86] : 

Datetimelndex([ '2000-01-01 00:00:00', '2000-01-01 04:00:00', 

'2000-01-01 08:00:00', '2000-01-01 12:00:00', 

'2000-01-01 16:00:00', '2000-01-01 20:00:00', 

'2000-01-02 00:00:00', '2000-01-02 04:00:00', 

'2000-01-02 08:00:00', '2000-01-02 12:00:00', 

'2000-01-02 16:00:00', '2000-01-02 20:00:00', 
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'2000-01-03 00:00:00', '2000-01-03 04:00:00', 

'2000-01-03 08:00:00', '2000-01-03 12:00:00', 

'2000-01-03 16:00:00', '2000-01-03 20:00:00'], 
dtype='datetime64[ns] 1 , freq=’4H' ) 

Many offsets can be combined together by addition: 

In [87]: Hour(2) + Minute(30) 

0ut[87]: <150 * Minutes> 

Similarly, you can pass frequency strings, like 'lh30min', that will effectively be 
parsed to the same expression: 

In [88]: pd.date_range( '2000-01-01' , periods=10, freq=' lh30mln' ) 

0ut[88] : 

Datetimelndex([ '2000-01-01 00:00:00', '2000-01-01 01:30:00', 

'2000-01-01 03:00:00', '2000-01-01 04:30:00', 

'2000-01-01 06:00:00', '2000-01-01 07:30:00', 

'2000-01-01 09:00:00', '2000-01-01 10:30:00', 

'2000-01-01 12:00:00', '2000-01-01 13:30:00'], 
dtype= 'datetime64[ns]' , freq='90T' ) 

Some frequencies describe points in time that are not evenly spaced. For example, ' M 1 
(calendar month end) and ' BM' (last business/weekday of month) depend on the 
number of days in a month and, in the latter case, whether the month ends on a 
weekend or not. We refer to these as anchored offsets. 

Refer back to Table 11-4 for a listing of frequency codes and date offset classes avail¬ 
able in pandas. 



Users can define their own custom frequency classes to provide 
date logic not available in pandas, though the full details of that are 
outside the scope of this book. 


Week of month dates 

One useful frequency class is “week of month,” starting with WOM. This enables you to 
get dates like the third Friday of each month: 

In [89]: rng = pd.date_range( '2012-01-01' , '2012-09-01', freq= 1 W0M-3FRI' ) 


In [90]: Itst(rng) 

Out[90] : 

[Timestamp! ’2012-01-20 00:00:00' , 
Timestamp! '2012-02-17 00:00:00', 
Timestamp! '2012-03-16 00:00:00', 
Timestamp(' 2012-04-20 00:00:00', 
Timestamp! '2012-05-18 00:00:00', 
Timestamp! '2012-06-15 00:00:00', 


freq=' W0M-3FRI' ), 
freq=' W0M-3FRI' ), 
freq=' W0M-3FRI' ), 
freq=' W0M-3FRI' ), 
freq=' W0M-3FRI' ), 
freq=' W0M-3FRI' ), 
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Timestamp ( '2012-07-20 00:00:00', freq=' W0M-3FRI' ), 

Ti.mestamp( '2012-08-17 00:00:00', freq='W0M-3FRI' )] 

Shifting (Leading and Lagging) Data 

“Shifting” refers to moving data backward and forward through time. Both Series and 
DataFrame have a shift method for doing naive shifts forward or backward, leaving 
the index unmodified: 

In [91]: ts = pd.Series(np.random. randn(4) , 

....: tndex=pd.date_range( '1/1/2000' , periods=4, freq='M')) 


In [92]: ts 


0ut[92] : 


2000-01-31 

-0.066748 

2000-02-29 

0.838639 

2000-03-31 

-0.117388 

2000-04-30 

-0.517795 

Freq: M, dtype: ftoat64 

In [93]: ts, 

.shift(2) 

0ut[93] : 


2000-01-31 

NaN 

2000-02-29 

NaN 

2000-03-31 

-0.066748 

2000-04-30 

0.838639 

Freq: M, dtype: ftoat64 

In [94]: ts, 

.shtft(-2) 

0ut[94] : 


2000-01-31 

-0.117388 

2000-02-29 

-0.517795 

2000-03-31 

NaN 

2000-04-30 

NaN 

Freq: M, dtype: ftoat64 


When we shift like this, missing data is introduced either at the start or the end of the 
time series. 

A common use of shift is computing percent changes in a time series or multiple 
time series as DataFrame columns. This is expressed as: 

ts / ts.shift(i) - l 

Because naive shifts leave the index unmodified, some data is discarded. Thus if the 
frequency is known, it can be passed to shift to advance the timestamps instead of 
simply the data: 

In [95]: ts.shift(2, freq='M') 

0ut[95] : 

2000-03-31 -0.066748 

2000-04-30 0.838639 
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2000-05-31 -0.117388 

2000-06-30 -0.517795 

Freq: M, dtype: float64 

Other frequencies can be passed, too, giving you some flexibility in how to lead and 
lag the data: 

In [96]: ts.shtft(3, freq='D') 

0ut[96] : 

2000-02-03 -0.066748 

2000-03-03 0.838639 

2000-04-03 -0.117388 

2000-05-03 -0.517795 

dtype: float64 

In [97]: ts.shtft(l, freq='90T') 

0ut[97] : 

2000-01-31 01:30:00 -0.066748 

2000-02-29 01:30:00 0.838639 

2000-03-31 01:30:00 -0.117388 

2000-04-30 01:30:00 -0.517795 

Freq: M, dtype: ftoat64 

The T here stands for minutes. 

Shifting dates with offsets 

The pandas date offsets can also be used with datetime or Timestamp objects: 

In [98]: from import Day, MonthEnd 

In [99]: now = datetime(2011, 11, 17) 

In [100]: now + 3 * Day() 

Out[100] : Timestamp( '2011-11-20 00:00:00') 

If you add an anchored offset like MonthEnd, the first increment will “roll forward” a 
date to the next date according to the frequency rule: 

In [101]: now + MonthEnd!) 

Out[101] : Timestamp! '2011-11-30 00:00:00') 

In [102]: now + MonthEnd(2) 

Out[102] : Timestamp! '2011-12-31 00:00:00') 

Anchored offsets can explicitly “roll” dates forward or backward by simply using their 
rollforward and rollback methods, respectively: 

In [103]: offset = MonthEnd!) 

In [104]: offset.rollforward(now) 

Out[104] : Timestamp! '2011-11-30 00:00:00') 
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In [105]: offset.rollback(now) 

Out[105] : TimestampC 2011-10-31 00:00:00') 

A creative use of date offsets is to use these methods with groupby: 

In [106]: ts = pd.Series(np.random.randn(20) , 

.: lndex=pd.date_range( 1 1/15/2000', periods=20, freq='4d')) 


In [107]: ts 
Out[107] : 

2000-01-15 -0.116696 
2000-01-19 2.389645 
2000-01-23 -0.932454 
2000-01-27 -0.229331 
2000-01-31 -1.140330 
2000-02-04 0.439920 
2000-02-08 -0.823758 
2000-02-12 -0.520930 


2000-02-16 0.350282 

2000-02-20 0.204395 

2000-02-24 0.133445 

2000-02-28 0.327905 

2000-03-03 0.072153 

2000-03-07 0.131678 

2000-03-11 -1.297459 

2000-03-15 0.997747 

2000-03-19 0.870955 

2000-03-23 -0.991253 

2000-03-27 0.151699 

2000-03-31 1.266151 

Freq: 4D, dtype: float64 


In [108]: ts.groupby(offset.roltforward),mean( ) 

Out[108] : 

2000-01-31 -0.005833 

2000-02-29 0.015894 

2000-03-31 0.150209 

dtype: float64 

Of course, an easier and faster way to do this is using resample (we’ll discuss this in 
much more depth in Section 11.6, “Resampling and Frequency Conversion,” on page 
348): 


In [109]: ts. 
Out[109] : 

, resample( 'M') .meanQ 

2000-01-31 

-0.005833 

2000-02-29 

0.015894 

2000-03-31 

0.150209 

Freq: M, dtype: float64 
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11.4 Time Zone Handling 

Working with time zones is generally considered one of the most unpleasant parts of 
time series manipulation. As a result, many time series users choose to work with 
time series in coordinated universal time or UTC, which is the successor to Greenwich 
Mean Time and is the current international standard. Time zones are expressed as 
offsets from UTC; for example, New York is four hours behind UTC during daylight 
saving time and five hours behind the rest of the year. 

In Python, time zone information comes from the third-party pytz library (installa¬ 
ble with pip or conda), which exposes the Olson database, a compilation of world 
time zone information. This is especially important for historical data because the 
daylight saving time (DST) transition dates (and even UTC offsets) have been 
changed numerous times depending on the whims of local governments. In the Uni¬ 
ted States, the DST transition times have been changed many times since 1900! 

For detailed information about the pytz library, you’ll need to look at that library’s 
documentation. As far as this book is concerned, pandas wraps pytz’s functionality so 
you can ignore its API outside of the time zone names. Time zone names can be 
found interactively and in the docs: 

In [110]: import 

In [111]: pytz.common_timezones[-5: ] 

Out[lll]: [' US/Eastern ' , 'US/Hawaii', 1 US/Mountain' , ' US/Paciflc' , 'UTC'] 

To get a time zone object from pytz, use pytz. timezone: 

In [112]: tz = pytz.tlmezone( 'Amertca/New_York' ) 

In [113]: tz 

0ut[113] : <DstTzInfo 'America/New_York' LMT-1 day, 19:04:00 STD> 

Methods in pandas will accept either time zone names or these objects. 

Time Zone Localization and Conversion 

By default, time series in pandas are time zone naive. For example, consider the fol¬ 
lowing time series: 

In [114]: rng = pd.date_range( '3/9/2012 9:30', periods=6, freq='D') 

In [115]: ts = pd.Series(np.random.randn(ten(rng) ), index=rng) 

In [116]: ts 
0ut[116] : 

2012-03-09 09:30:00 -0.202469 

2012-03-10 09:30:00 0.050718 

2012-03-11 09:30:00 0.639869 

2012-03-12 09:30:00 0.597594 
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2012-03-13 09:30:00 -0.797246 
2012-03-14 09:30:00 0.472879 
Freq: D, dtype: float64 

The index’s tz field is None: 


In [117]: pri.nt(ts. index, tz) 
None 


Date ranges can be generated with a time zone set: 

In [118]: pd.date_range( '3/9/2012 9:30', periods=10, freq='D', tz='UTC') 
0ut[118] : 

Datetimelndex([ '2012-03-09 09:30:00+00:00', '2012-03-10 09:30:00+00:00’, 
’2012-03-11 09:30:00+00:00’, '2012-03-12 09:30:00+00:00’, 
’2012-03-13 09:30:00+00:00', '2012-03-14 09:30:00+00:00’, 
’2012-03-15 09:30:00+00:00', '2012-03-16 09:30:00+00:00’, 
’2012-03-17 09:30:00+00:00', '2012-03-18 09:30:00+00:00’], 
dtype=’datetine64[ns, UTC]', freq=’D’) 

Conversion from naive to localized is handled by the tzjlocalize method: 


In [119]: ts 
0ut[119]: 

2012-03-09 09:30:00 -0.202469 
2012-03-10 09:30:00 0.050718 
2012-03-11 09:30:00 0.639869 
2012-03-12 09:30:00 0.597594 
2012-03-13 09:30:00 -0.797246 
2012-03-14 09:30:00 0.472879 
Freq: D, dtype: ftoat64 


In [120]: ts_utc = ts.tz_Iocalize( ’UTC’ ) 


In [121]: ts_utc 
0ut[121]: 

2012-03-09 09:30:00+00:00 
2012-03-10 09:30:00+00:00 
2012-03-11 09:30:00+00:00 
2012-03-12 09:30:00+00:00 
2012-03-13 09:30:00+00:00 
2012-03-14 09:30:00+00:00 
Freq: D, dtype: ftoat64 


-0.202469 

0.050718 

0.639869 

0.597594 

-0.797246 

0.472879 


In [122]: ts_utc.Index 
0ut[122]: 

Datetimelndex([ ’2012-03-09 09:30:00+00:00', '2012-03-10 09:30:00+00:00', 
'2012-03-11 09:30:00+00:00', '2012-03-12 09:30:00+00:00’, 
’2012-03-13 09:30:00+00:00’, ’2012-03-14 09:30:00+00:00’], 
dtype='datetime64[ns, UTC]’, freq='D') 


Once a time series has been localized to a particular time zone, it can be converted to 
another time zone with tz convert: 


336 | Chapter 11: Time Series 



In [123]: ts_utc.tz_convert( 'America/New_York' ) 

Out[ 123] : 

2012-03-09 04:30:00-05:00 -0.202469 

2012-03-10 04:30:00-05:00 0.050718 

2012-03-11 05:30:00-04:00 0.639869 

2012-03-12 05:30:00-04:00 0.597594 

2012-03-13 05:30:00-04:00 -0.797246 

2012-03-14 05:30:00-04:00 0.472879 

Freq: D, dtype: float64 

In the case of the preceding time series, which straddles a DST transition in the Amer 
ica/New_York time zone, we could localize to EST and convert to, say, UTC or Berlin 
time: 

In [124]: ts_eastern = ts.tz_localtze( 'Amertca/New_York' ) 


In [125]: ts_eastern.tz_convert( 'UTC' ) 
0ut[125]: 

2012-03-09 14:30:00+00:00 -0.202469 
2012-03-10 14:30:00+00:00 0.050718 
2012-03-11 13:30:00+00:00 0.639869 
2012-03-12 13:30:00+00:00 0.597594 
2012-03-13 13:30:00+00:00 -0.797246 
2012-03-14 13:30:00+00:00 0.472879 
Freq: D, dtype: ftoat64 


In [126]: ts_eastern.tz_convert(' Europe/Berltn' ) 

0ut[126]: 

2012-03-09 15:30:00+01:00 -0.202469 

2012-03-10 15:30:00+01:00 0.050718 

2012-03-11 14:30:00+01:00 0.639869 

2012-03-12 14:30:00+01:00 0.597594 

2012-03-13 14:30:00+01:00 -0.797246 

2012-03-14 14:30:00+01:00 0.472879 

Freq: D, dtype: ftoat64 

tz_localize and tz_convert are also instance methods on Datetimelndex: 

In [127]: ts.index.tz_localtze( 'Asta/Shanghai' ) 

0ut[127]: 

Datetimelndex([ '2012-03-09 09:30:00+08:00', '2012-03-10 09:30:00+08:00’, 
’2012-03-11 09:30:00+08:00', '2012-03-12 09:30:00+08:00’, 
’2012-03-13 09:30:00+08:00', ’2012-03-14 09:30:00+08:00’], 
dtype='datetime64[ns, Asia/Shanghai]', freq='D') 



Localizing naive timestamps also checks for ambiguous or non¬ 
existent times around daylight saving time transitions. 
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Operations with Time Zone-Aware Timestamp Objects 

Similar to time series and date ranges, individual Timestamp objects similarly can be 
localized from naive to time zone-aware and converted from one time zone to 
another: 

In [128]: stamp = pd.TimestampC '2011-03-12 04:00') 

In [129]: stamp_utc = stamp.tz_Iocalize(' utc' ) 

In [130]: stamp_utc.tz_convert( 'America/New_York' ) 

Out[130]: TimestampC 2011-03-11 23:00:00-0500', tz='America/New_York' ) 

You can also pass a time zone when creating the Timestamp: 

In [131]: stamp_moscow = pd.TimestampC 1 2011-03-12 04:00', tz=' Europe/Moscow' ) 

In [132]: stamp_moscow 

0ut[132]: Timestamp(' 2011-03-12 04:00:00+0300', tz=' Europe/Moscow' ) 

Time zone-aware Timestamp objects internally store a UTC timestamp value as nano¬ 
seconds since the Unix epoch (January 1, 1970); this UTC value is invariant between 
time zone conversions: 

In [133]: stamp_utc.value 
0ut[133]: 1299902400000000000 

In [134]: stamp_utc,tz_convert( 'America/New_York' ) .value 
0ut[134]: 1299902400000000000 

When performing time arithmetic using pandas’s DateOffset objects, pandas 
respects daylight saving time transitions where possible. Here we construct time- 
stamps that occur right before DST transitions (forward and backward). First, 30 
minutes before transitioning to DST: 

In [135]: from import Hour 

In [136]: stamp = pd.TimestampC '2012-03-12 01:30', tz=' US/Eastern' ) 

In [137]: stamp 

0ut[137]: TimestampC '2012-03-12 01:30:00-0400', tz= 'US/Eastern' ) 

In [138]: stamp + Hour() 

0ut[138]: TimestampC '2012-03-12 02:30:00-0400', tz= 'US/Eastern' ) 

Then, 90 minutes before transitioning out of DST: 

In [139]: stamp = pd.TimestampC '2012-11-04 00:30', tz=' US/Eastern' ) 

In [140]: stamp 

Out[140]: TimestampC '2012-11-04 00:30:00-0400', tz= 'US/Eastern' ) 
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In [141]: stamp + 2 * Hour() 

0ut[141]: TimestampC 2012-11-04 01:30:00-0500', tz= 'US/Eastern' ) 

Operations Between Different Time Zones 

If two time series with different time zones are combined, the result will be UTC. 
Since the timestamps are stored under the hood in UTC, this is a straightforward 
operation and requires no conversion to happen: 

In [142]: rng = pd.date_range( '3/7/2012 9:30', periods=10, freq='B') 

In [143]: ts = pd.Series(np.random.randn(len(rng) ), index=rng) 


In [144]: ts 


0ut[144] : 

2012-03-07 

2012-03-08 

2012-03-09 

2012-03-12 

2012-03-13 

2012-03-14 

2012-03-15 

2012-03-16 


09:30:00 

09:30:00 

09:30:00 

09:30:00 

09:30:00 

09:30:00 

09:30:00 

09:30:00 

09:30:00 

09:30:00 


0.522356 

-0.546348 

-0.733537 

1.302736 

0.022199 

0.364287 

-0.922839 

0.312656 

-1.128497 

-0.333488 


2012-03-19 
2012-03-20 
Freq: B, dtype: float64 


In [145]: tsl = ts[ :7]. tz_localtze( 'Europe/London' ) 


In [146]: ts2 = tsl[2 :]. tz_convert(' Europe/Moscow' ) 


In [147]: result = tsl + ts2 

In [148]: result.Index 
0ut[148] : 

Datetimelndex([ '2012-03-07 09:30:00+00:00', '2012-03-08 09:30:00+00:00’, 
’2012-03-09 09:30:00+00:00', '2012-03-12 09:30:00+00:00', 
'2012-03-13 09:30:00+00:00', '2012-03-14 09:30:00+00:00', 
'2012-03-15 09:30:00+00:00'], 
dtype='datetime64[ns, UTC]', freq='B') 

11.5 Periods and Period Arithmetic 

Periods represent timespans, like days, months, quarters, or years. The Period class 
represents this data type, requiring a string or integer and a frequency from 
Table 11-4: 

In [149]: p = pd. Period(2007, freq='A-DEC' ) 


In [150]: p 

Out[150]: Period( '2007' , 'A-DEC') 
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In this case, the Period object represents the full timespan from January 1, 2007, to 
December 31, 2007, inclusive. Conveniently, adding and subtracting integers from 
periods has the effect of shifting by their frequency: 

In [151]: p + 5 

0ut[151] : Period( ' 2012' , ’A-DEC) 

In [152]: p - 2 

0ut[152] : Period( '2005' , 'A-DEC') 

If two periods have the same frequency, their difference is the number of units 
between them: 

In [153]: pd.Period(' 2014' , freq='A-DEC' ) - p 
0ut[153] : 7 

Regular ranges of periods can be constructed with the period_range function: 

In [154]: rng = pd.perlod_range(' 2000-01-01' , '2000-06-30', freq='H') 

In [155]: rng 

0ut[155] : PeriodIndex( [' 2000-01' , '2000-02', '2000-03', '2000-04', ’2000-05', '20 
00-06'], dtype=' period[M]' , freq='M') 

The Periodlndex class stores a sequence of periods and can serve as an axis index in 
any pandas data structure: 

In [156]: pd.Series(np.random.randn(6), index=rng) 

0ut[156] : 

2000-01 -0.514551 

2000-02 -0.559782 

2000-03 -0.783408 

2000-04 -1.797685 

2000-05 -0.172670 

2000-06 0.680215 

Freq: M, dtype: ftoat64 

If you have an array of strings, you can also use the Periodlndex class: 

In [157]: values = [ 1 2001Q3' , '2002Q2', '2003Q1'] 

In [158]: index = pd.PeriodIndex(values, freq= 'Q-DEC' ) 

In [159]: index 

0ut[159] : PeriodIndex( [ '2001Q3' , '2002Q2', ’2003Q1’], dtype=’ period[Q-DEC] 1 , freq 
= 1 Q-DEC 1 ) 

Period Frequency Conversion 

Periods and Periodlndex objects can be converted to another frequency with their 
asf req method. As an example, suppose we had an annual period and wanted to 
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convert it into a monthly period either at the start or end of the year. This is fairly 
straightforward: 

In [160]: p = pd.Pertod( '2007' , freq='A-DEC ) 

In [161]: p 

0ut[161] : Period( ’ 2007' , ’A-DEC) 

In [162]: p.asfreq( 1 M 1 , how='start') 

0ut[162] : Period( '2007-01' , 'M' ) 

In [163]: p.asfreq(' M' , how='end') 

0ut[163] : Period( '2007-12' , 'M’ ) 

You can think of Period( '2007', 'A-DEC' ) as being a sort of cursor pointing to a 
span of time, subdivided by monthly periods. See Figure 11-1 for an illustration of 
this. For a fiscal year ending on a month other than December, the corresponding 
monthly subperiods are different: 

In [164]: p = pd.Pertod( '2007' , freq= 'A-JUN' ) 

In [165]: p 

0ut[165] : Period( '2007' , ' A-JUN' ) 

In [166]: p.asfreq(' M' , 'start') 

0ut[166] : Period( '2006-07' , 'M' ) 

In [167]: p.asfreq(' M' , 'end') 

0ut[167] : Period( '2007-06' , 'M' ) 

Period('2011-06','M') 

Start J ‘ End 


i 1 


Jan 

Feb 

Mar 

Apr 

May 

Jun 

Jul 

Aug 

Sep 

Oct 

Nov 

Dec 


Period('201T, A-DEC') 

Figure 11-1. Period frequency conversion illustration 

When you are converting from high to low frequency, pandas determines the super¬ 
period depending on where the subperiod “belongs.” For example, in A-DUN fre¬ 
quency, the month Aug - 2007 is actually part of the 2008 period: 

In [168]: p = pd.Period( 'Aug-2007' , 'M' ) 

In [169]: p.asfreq( 'A-JUN' ) 

0ut[169]: Period( '2008' , 'A-JUN' ) 
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Whole Periodlndex objects or time series can be similarly converted with the same 
semantics: 

In [170]: rng = pd.perlod_range(' 2006' , '2009', freq=' A-DEC 1 ) 

In [171]: ts = pd.Series(np.random.randn(ten(rng) ), index=rng) 

In [172]: ts 
0ut[172] : 

2006 1.607578 

2007 0.200381 

2008 -0.834068 

2009 -0.302988 

Freq: A-DEC, dtype: float64 

In [173]: ts.asfreq(' M' , how='start') 

0ut[173] : 

2006- 01 1.607578 

2007- 01 0.200381 

2008- 01 -0.834068 

2009- 01 -0.302988 

Freq: M, dtype: ftoat64 

Here, the annual periods are replaced with monthly periods corresponding to the first 
month falling within each annual period. If we instead wanted the last business day of 
each year, we can use the ' B 1 frequency and indicate that we want the end of the 
period: 

In [174]: ts.asfreq(' B' , how='end') 

0ut[174] : 

2006- 12-29 1.607578 

2007- 12-31 0.200381 

2008- 12-31 -0.834068 

2009- 12-31 -0.302988 

Freq: B, dtype: ftoat64 

Quarterly Period Frequencies 

Quarterly data is standard in accounting, finance, and other fields. Much quarterly 
data is reported relative to a fiscal year end, typically the last calendar or business day 
of one of the 12 months of the year. Thus, the period 2012Q4 has a different meaning 
depending on fiscal year end. pandas supports all 12 possible quarterly frequencies as 
Q-JAN through Q-DEC: 

In [175]: p = pd.Period( '2012Q4' , freq='Q-JAN' ) 

In [176]: p 

0ut[176] : Period( '2012Q4' , 'Q-JAN') 
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In the case of fiscal year ending in January, 2012Q4 runs from November through Jan¬ 
uary, which you can check by converting to daily frequency. See Figure 11-2 for an 
illustration. 



Figure 11-2. Different quarterly frequency conventions 


In [177]: p.asfreq(' D' , 'start') 

0ut[177] : Period( '2011-11-01' , 'D') 

In [178]: p.asfreq( 'D 1 , 'end') 

0ut[178] : Period( '2012-01-31' , 'D') 

Thus, it’s possible to do easy period arithmetic; for example, to get the timestamp at 4 
PM on the second-to-last business day of the quarter, you could do: 

In [179]: p4pm = (p.asfreq(' B' , ' e' ) - 1) .asfreq(' T' , 's') + 16 * 60 
In [180]: p4pn 

Out[180] : Period( '2012-01-30 16:00' , 'T' ) 

In [181]: p4pm.to_timestamp() 

0ut[181] : TlmestampC 2012-01-30 16:00:00') 

You can generate quarterly ranges using period_range. Arithmetic is identical, too: 

In [182]: rng = pd.perlod_range( 1 2011Q3’ , ’2012Q4’, freq='Q-JAN’ ) 

In [183]: ts = pd.Series(np.arange(ten(rng)), index=rng) 

In [184]: ts 
0ut[184] : 

2011Q3 0 

2011Q4 1 

2012Q1 2 

2012Q2 3 

2012Q3 4 

2012Q4 5 

Freq: Q-JAN, dtype: int64 

In [185]: new_rng = (rng.asfreq(' B' , 'e' ) - 1). asfreq(' T' , ’s') + 16 * 60 
In [186]: ts.index = new_rng.to_tinestanp() 
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In [187]: ts 
0ut[187] : 

2010-10-28 16:00:00 0 

2011-01-28 16:00:00 1 

2011-04-28 16:00:00 2 

2011-07-28 16:00:00 3 

2011- 10-28 16:00:00 4 

2012- 01-30 16:00:00 5 

dtype: i.nt64 

Converting Timestamps to Periods (and Back) 

Series and DataFrame objects indexed by timestamps can be converted to periods 
with the to_period method: 

In [188]: rng = pd.date_range( '2000-01-01' , periods=3, freq='M') 

In [189]: ts = pd.Series(np.random.randn(3) , index=rng) 

In [190]: ts 
Out[190] : 

2000-01-31 1.663261 

2000-02-29 -0.996206 

2000-03-31 1.521760 

Freq: M, dtype: float64 

In [191]: pts = ts.to_period() 

In [192]: pts 
0ut[192] : 

2000-01 1.663261 

2000-02 -0.996206 

2000-03 1.521760 

Freq: M, dtype: float64 

Since periods refer to non-overlapping timespans, a timestamp can only belong to a 
single period for a given frequency. While the frequency of the new Periodlndex is 
inferred from the timestamps by default, you can specify any frequency you want. 
There is also no problem with having duplicate periods in the result: 

In [193]: rng = pd.date_range( '1/29/2000' , periods=6, freq=’D') 

In [194]: ts2 = pd.Series(np.random.randn(6) , lndex=rng) 

In [195]: ts2 
0ut[195] : 

2000-01-29 0.244175 

2000-01-30 0.423331 

2000-01-31 -0.654040 

2000-02-01 2.089154 

2000-02-02 -0.060220 
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2000-02-03 -0.167933 

Freq: D, dtype: float64 


In [196]: ts2.to_perlod( 'M ' ) 
0ut[196] : 

2000-01 0.244175 
2000-01 0.423331 
2000-01 -0.654040 
2000-02 2.089154 
2000-02 -0.060220 
2000-02 -0.167933 


Freq: M, dtype: float64 


To convert back to timestamps, use to_timestamp: 


In [197]: pts = ts2.to_perlod( ) 


In [198]: pts 
0ut[198] : 

2000-01-29 0.244175 

2000-01-30 0.423331 

2000-01-31 -0.654040 

2000-02-01 2.089154 

2000-02-02 -0.060220 

2000-02-03 -0.167933 

Freq: D, dtype: ftoat64 


In [199]: pts.to_timestanp(how= 'end 1 ) 
0ut[199]: 

2000-01-29 0.244175 
2000-01-30 0.423331 
2000-01-31 -0.654040 
2000-02-01 2.089154 


2000-02-02 -0.060220 


2000-02-03 -0.167933 

Freq: D, dtype: ftoat64 


Creating a Periodlndexfrom Arrays 

Fixed frequency datasets are sometimes stored with timespan information spread 
across multiple columns. For example, in this macroeconomic dataset, the year and 
quarter are in different columns: 

In [200]: data = pd.read_csv( 'exanples/nacrodata.csv' ) 


In [201]: data.head(5) 
Out[201] : 



year 

quarter 

reatgdp 

realcons 

realinv 

realgovt 

reatdpl 

cpt 

0 

1959.0 

1.0 

2710.349 

1707.4 

286.898 

470.045 

1886.9 

28.98 

1 

1959.0 

2.0 

2778.801 

1733.7 

310.859 

481.301 

1919.7 

29.15 

2 

1959.0 

3.0 

2775.488 

1751.8 

289.226 

491.260 

1916.4 

29.35 

3 

1959.0 

4.0 

2785.204 

1753.7 

299.356 

484.052 

1931.3 

29.37 
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4 


1955.5 


29.54 


1960.0 

1.0 

2847.699 1770.5 331.722 

ml 

tbilrate 

unemp 

pop 

infl 

realint 

139.7 

2.82 

5.8 

177.146 

0.00 

0.00 

141.7 

3.08 

5.1 

177.830 

2.34 

0.74 

140.5 

3.82 

5.3 

178.657 

2.74 

1.09 

140.0 

4.33 

5.6 

179.386 

0.27 

4.06 

139.6 

3.50 

5.2 

180.007 

2.31 

1.19 

[202]: 

data.year 






Out [ 202 ]: 

0 1959.0 

1 1959.0 

2 1959.0 

3 1959.0 

4 1960.0 

5 1960.0 

6 1960.0 

7 1960.0 

8 1961.0 

9 1961.0 

193 2007.0 

194 2007.0 

195 2007.0 

196 2008.0 

197 2008.0 

198 2008.0 

199 2008.0 

200 2009.0 

201 2009.0 

202 2009.0 

Name: year. Length: 203, dtype: float64 

In [203]: data.quarter 
Out[203] : 


0 

1.0 

1 

2.0 

2 

3.0 

3 

4.0 

4 

1.0 

5 

2.0 

6 

3.0 

7 

4.0 

8 

1.0 

9 

2.0 

193 

2.0 

194 

3.0 

195 

4.0 

196 

1.0 

197 

2.0 

198 

3.0 
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199 4.0 

200 1.0 

201 2.0 
202 3.0 

Name: quarter. Length: 203, dtype: float64 

By passing these arrays to Periodlndex with a frequency, you can combine them to 
form an index for the DataFrame: 

In [204]: Index = pd.PeriodIndex(year=data.year, quarter=data.quarter, 

.: freq='Q-DEC') 


In [205]: Index 
Out[205] : 

PeriodIndex( [ 1 1959Q1' , '1959Q2' , ' 1959Q3' , '1959Q4', '1960Q1', '1960Q2', 

'1960Q3', '1960Q4', '1961Q1', '1961Q2', 

'2007Q2', '2007Q3', '2007Q4', ’2008Q1', '2008Q2', '2008Q3', 

'2008Q4', '2009Q1', '2009Q2', ’2009Q3'], 
dtype= 1 perlod[Q-DEC] 1 , length=203, freq='Q-DEC' ) 


In [206]: data.index = index 


In [207]: data.infl 
Out[207] : 


1959Q1 0.00 
1959Q2 2.34 
1959Q3 2.74 
1959Q4 0.27 
1960Q1 2.31 
1960Q2 0.14 
1960Q3 2.70 
1960Q4 1.21 
1961Q1 -0.40 
1961Q2 1.47 


2007Q2 2.75 
2007Q3 3.45 
2007Q4 6.38 
2008Q1 2.82 
2008Q2 8.53 
2008Q3 -3.16 
2008Q4 -8.79 
2009Q1 0.94 
2009Q2 3.37 
2009Q3 3.56 

Freq: Q-DEC, Name: infl. Length: 203, dtype: float64 
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11.6 Resampling and Frequency Conversion 

Resampling refers to the process of converting a time series from one frequency to 
another. Aggregating higher frequency data to lower frequency is called downsam¬ 
pling, while converting lower frequency to higher frequency is called upsampling. Not 
all resampling falls into either of these categories; for example, converting W-WED 
(weekly on Wednesday) to W-FRI is neither upsampling nor downsampling. 

pandas objects are equipped with a resample method, which is the workhorse func¬ 
tion for all frequency conversion, resample has a similar API to groupby; you call 
resample to group the data, then call an aggregation function: 

In [208]: rng = pd.date_range( '2000-01-01' , periods=100, freq='D') 

In [209]: ts = pd.Serles(np.random.randn(len(rng) ), index=rng) 


In [210]: ts 
Out[210] : 

2000-01-01 0.631634 
2000-01-02 -1.594313 
2000-01-03 -1.519937 
2000-01-04 1.108752 
2000-01-05 1.255853 
2000-01-06 -0.024330 


2000-01-07 

2000-01-08 

2000-01-09 

2000 - 01-10 


-2.047939 

-0.272657 

-1.692615 

1.423830 


2000-03-31 -0.007852 

2000-04-01 -1.638806 

2000-04-02 1.401227 

2000-04-03 1.758539 

2000-04-04 0.628932 

2000-04-05 -0.423776 

2000-04-06 0.789740 

2000-04-07 0.937568 

2000-04-08 -2.253294 

2000-04-09 -1.772919 

Freq: D, Length: 100, dtype: float64 


In [211]: ts. resample( 'M' ) .meanQ 
0ut[211] : 

2000-01-31 -0.165893 

2000-02-29 0.078606 

2000-03-31 0.223811 

2000-04-30 -0.063643 

Freq: M, dtype: float64 


In [212]: ts.resample( 'M' , kind= 1 period ') ,mean() 
0ut[212] : 
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2000-01 -0.165893 

2000-02 0.078606 

2000-03 0.223811 

2000-04 -0.063643 

Freq: M, dtype: float64 

resample is a flexible and high-performance method that can be used to process very 
large time series. The examples in the following sections illustrate its semantics and 
use. Table 11-5 summarizes some of its options. 


Table 11-5. Resample method arguments 


Argument Description 


freq 

axis 

flll_method 

closed 

label 

loffset 

limit 

kind 

convention 


String or DateOffset indicating desired resampled frequency (e.g., 'M 1 , '5min', or Second (15)) 

Axis to resample on; default axis=0 

How to interpolate when upsampling, as in ' f fill ' or 1 bf ill '; by default does no interpolation 
In downsampling, which end of each interval is closed (inclusive), 1 right' or' left' 

In downsampling, how to label the aggregated result, with the ' right 1 or' left 1 bin edge (e.g., the 
9:30 to 9:35 five-minute interval could be labeled 9:30 or 9:35) 

Time adjustment to the bin labels, such as ' - Is 1 /Secondf -1) to shift the aggregate labels one 
second earlier 

When forward or backward filling, the maximum number of periods to fill 

Aggregate to periods (' period') or timestamps ( 1 timestamp 1 ); defaults to the type of index the 

time series has 

When resampling periods, the convention ('start' or ' end') for converting the low-frequency period 
to high frequency; defaults to ' end' 


Downsampling 

Aggregating data to a regular, lower frequency is a pretty normal time series task. The 
data you’re aggregating doesn’t need to be fixed frequently; the desired frequency 
defines bin edges that are used to slice the time series into pieces to aggregate. For 
example, to convert to monthly, 1 M' or ' BM', you need to chop up the data into one- 
month intervals. Each interval is said to be half-open; a data point can only belong to 
one interval, and the union of the intervals must make up the whole time frame. 
There are a couple things to think about when using resample to downsample data: 

• Which side of each interval is closed 

• How to label each aggregated bin, either with the start of the interval or the end 

To illustrate, let’s look at some one-minute data: 

In [213]: rng = pd.date_range( '2000-01-01' , periods=12, freq='T') 

In [214]: ts = pd.Serles(np.arange(12), index=rng) 
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In [215]: ts 
Out[215] : 

2000-01-01 00:00:00 0 

2000-01-01 00:01:00 1 

2000-01-01 00:02:00 2 

2000-01-01 00:03:00 3 

2000-01-01 00:04:00 4 

2000-01-01 00:05:00 5 

2000-01-01 00:06:00 6 

2000-01-01 00:07:00 7 

2000-01-01 00:08:00 8 

2000-01-01 00:09:00 9 

2000-01-01 00:10:00 10 

2000-01-01 00:11:00 11 

Freq: T, dtype: int64 

Suppose you wanted to aggregate this data into five-minute chunks or bars by taking 
the sum of each group: 

In [216]: ts.resample( '5min' , closed= 'right '). sum( ) 

0ut[216] : 

1999- 12-31 23:55:00 0 

2000- 01-01 00:00:00 15 

2000-01-01 00:05:00 40 

2000 - 01-01 00 : 10:00 11 

Freq: 5T, dtype: tnt64 

The frequency you pass defines bin edges in five-minute increments. By default, 
the left bin edge is inclusive, so the 00:00 value is included in the 00:00 to 00:05 
interval. 1 Passing closed=' right' changes the interval to be closed on the right: 

In [217]: ts.resample( '5min' , ctosed= 'right '). sum() 

0ut[217] : 

1999- 12-31 23:55:00 0 

2000- 01-01 00:00:00 15 

2000-01-01 00:05:00 40 

2000 - 01-01 00 : 10:00 11 

Freq: 5T, dtype: tnt64 

The resulting time series is labeled by the timestamps from the left side of each bin. 
By passing label=' right' you can label them with the right bin edge: 

In [218]: ts . resample( 1 5min' , ctosed= 'right' , label= 1 right '). sun( ) 

0ut[218] : 

2000 - 01-01 00 : 00:00 0 

2000-01-01 00:05:00 15 


1 The choice of the default values for closed and label might seem a bit odd to some users. In practice the 
choice is somewhat arbitrary; for some target frequencies, closed = 1 left 1 is preferable, while for others 
closed = 1 right' makes more sense. The important thing is that you keep in mind exactly how you are seg¬ 
menting the data. 
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2000-01-01 00:10:00 40 
2000-01-01 00:15:00 11 
Freq: 5T, dtype: lnt64 


See Figure 11-3 for an illustration of minute frequency data being resampled to five- 
minute frequency. 


closed='left' 

9:00 

9:01 

9:02 

9:03 

9:04 

9:05 








closed='right' 

9:00 

9:01 

9:02 

9:03 

9:04 

9:05 


t t 

label='left' label='right' 


Figure 11-3. Five-minute resampling illustration of closed, label conventions 


Lastly, you might want to shift the result index by some amount, say subtracting one 
second from the right edge to make it more clear which interval the timestamp refers 
to. To do this, pass a string or date offset to loff set: 


In [219] 


ts.resample( '5min' , closed= 'right' , 

labe!= 'right' , loffset= 1 -Is 1 ). sun( ) 


0ut[219] : 

1999- 12-31 23:59:59 0 

2000- 01-01 00:04:59 15 

2000-01-01 00:09:59 40 


2000-01-01 00:14:59 11 

Freq: 5T, dtype: int64 


You also could have accomplished the effect of loff set by calling the shift method 
on the result without the loff set. 


Open-High-Low-Close (OHLC) resampling 

In finance, a popular way to aggregate a time series is to compute four values for each 
bucket: the first (open), last (close), maximum (high), and minimal (low) values. By 
using the ohlc aggregate function you will obtain a DataFrame having columns con¬ 
taining these four aggregates, which are efficiently computed in a single sweep of the 
data: 

In [220]: ts.resample( '5min' ). ohtc( ) 

Out[220] : 

open high low close 
2000-01-01 00:00:00 040 4 

2000-01-01 00:05:00 595 9 

2000 - 01-01 00 : 10:00 10 11 10 11 
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Upsampling and Interpolation 

When converting from a low frequency to a higher frequency, no aggregation is 
needed. Let’s consider a DataFrame with some weekly data: 

In [221]: frame = pd.DataFrame(np.random.randn(2, 4), 

.: tndex=pd.date_range( '1/1/2000 1 , periods=2, 

.: freq='W-WED'), 

.: columns=[ 1 Colorado 1 , 'Texas', 'New York', 'Ohio']) 

In [222]: frame 
0ut[222] : 

Colorado Texas New York Ohio 
2000-01-05 -0.896431 0.677263 0.036503 0.087102 

2000-01-12 -0.046662 0.927238 0.482284 -0.867130 

When you are using an aggregation function with this data, there is only one value 
per group, and missing values result in the gaps. We use the asf req method to con¬ 
vert to the higher frequency without any aggregation: 


In [223]: 

df_daily = 

frame.resample( 'D' ). 

asfreq( ) 

In [224]: df_daily 
0ut[224]: 

Colorado 

Texas 

New York 

Ohio 

2000-01-05 -0.896431 

0.677263 

0.036503 

0.087102 

2000-01-06 

NaN 

NaN 

NaN 

NaN 

2000-01-07 

NaN 

NaN 

NaN 

NaN 

2000-01-08 

NaN 

NaN 

NaN 

NaN 

2000-01-09 

NaN 

NaN 

NaN 

NaN 

2000-01-10 

NaN 

NaN 

NaN 

NaN 

2000-01-11 

NaN 

NaN 

NaN 

NaN 

2000-01-12 -0.046662 

0.927238 

0.482284 

-0.867130 


Suppose you wanted to fill forward each weekly value on the non-Wednesdays. The 
same filling or interpolation methods available in the fillna and reindex methods 
are available for resampling: 


In [225]: frame.resample(' D' ) .ffill() 
0ut[225] : 



Colorado 

Texas 

New York 

Ohio 

2000-01-05 

-0.896431 

0.677263 

0.036503 

0.087102 

2000-01-06 

-0.896431 

0.677263 

0.036503 

0.087102 

2000-01-07 

-0.896431 

0.677263 

0.036503 

0.087102 

2000-01-08 

-0.896431 

0.677263 

0.036503 

0.087102 

2000-01-09 

-0.896431 

0.677263 

0.036503 

0.087102 

2000-01-10 

-0.896431 

0.677263 

0.036503 

0.087102 

2000-01-11 

-0.896431 

0.677263 

0.036503 

0.087102 

2000-01-12 

-0.046662 

0.927238 

0.482284 

-0.867130 


You can similarly choose to only fill a certain number of periods forward to limit how 
far to continue using an observed value: 
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In [226]: frame.resample( 1 D '). ffill(limit=2) 
0ut[226] : 



Colorado 

Texas 

New York 

Ohio 

2000-01-05 

-0.896431 

0.677263 

0.036503 

0.087102 

2000-01-06 

-0.896431 

0.677263 

0.036503 

0.087102 

2000-01-07 

-0.896431 

0.677263 

0.036503 

0.087102 

2000-01-08 

NaN 

NaN 

NaN 

NaN 

2000-01-09 

NaN 

NaN 

NaN 

NaN 

2000-01-10 

NaN 

NaN 

NaN 

NaN 

2000-01-11 

NaN 

NaN 

NaN 

NaN 

2000-01-12 

-0.046662 

0.927238 

0.482284 

-0.867130 


Notably, the new date index need not overlap with the old one at all: 

In [227]: frame. resample( 'W-THU '). ffillQ 
0ut[227] : 

Colorado Texas New York Ohio 

2000-01-06 -0.896431 0.677263 0.036503 0.087102 

2000-01-13 -0.046662 0.927238 0.482284 -0.867130 


Resampling with Periods 

Resampling data indexed by periods is similar to timestamps: 


In [228]: frame = pd,DataFrame(np.random.randn(24, 4), 

.: index=pd.period_range('1-2000', '12-2001', 

.: freq=' M 1 ), 

.: columns=[ 1 Colorado', 'Texas', 'New York', 'Ohio']) 


In [229]: frame[:5] 

0ut[229] : 

Colorado Texas 
2000-01 0.493841 -0.155434 

2000-02 -1.179442 0.443171 

2000-03 0.787358 0.248845 

2000-04 1.302395 -0.272154 

2000-05 -1.040816 0.426419 


New York Ohio 
1.397286 1.507055 
1.395676 -0.529658 
0.743239 1.267746 
-0.051532 -0.467740 
0.312945 -1.115689 


In [230]: annual_frame = frame.resample( 'A-DEC ') ,mean() 

In [231]: annual_frame 
0ut[231] : 

Colorado Texas New York Ohio 

2000 0.556703 0.016631 0.111873 -0.027445 

2001 0.046303 0.163344 0.251503 -0.157276 

Upsampling is more nuanced, as you must make a decision about which end of the 
timespan in the new frequency to place the values before resampling, just like the 
asf req method. The convention argument defaults to ' start ' but can also be ' end': 

# Q-DEC: Quarterly, year ending in December 
In [232]: annual_frame. resample( 'Q-DEC '). ffillQ 
0ut[232] : 
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Colorado Texas New York 


Ohio 


2000Q1 

0.556703 

2000Q2 

0.556703 

2000Q3 

0.556703 

2000Q4 

0.556703 

2001Q1 

0.046303 

2001Q2 

0.046303 

2001Q3 

0.046303 

2001Q4 

0.046303 


0.016631 

0.016631 

0.016631 

0.016631 

0.163344 

0.163344 

0.163344 

0.163344 


0.111873 

0.111873 

0.111873 

0.111873 

0.251503 

0.251503 

0.251503 

0.251503 


-0.027445 

-0.027445 

-0.027445 

-0.027445 

-0.157276 

-0.157276 

-0.157276 

-0.157276 


In [233]: annual_frame.resample( 'Q-DEC' , convention= 1 end 
0ut[233] : 

Colorado Texas New York Ohio 


2000Q4 0.556703 
2001Q1 0.556703 
2001Q2 0.556703 
2001Q3 0.556703 
2001Q4 0.046303 


0.016631 

0.016631 

0.016631 

0.016631 

0.163344 


0.111873 

0.111873 

0.111873 

0.111873 

0.251503 


-0.027445 

-0.027445 

-0.027445 

-0.027445 

-0.157276 


• ffilK) 


Since periods refer to timespans, the rules about upsampling and downsampling are 
more rigid: 


• In downsampling, the target frequency must be a subperiod of the source 
frequency. 

• In upsampling, the target frequency must be a superperiod of the source 
frequency. 


If these rules are not satisfied, an exception will be raised. This mainly affects the 
quarterly, annual, and weekly frequencies; for example, the timespans defined by Q- 
MAR only line up with A-MAR, A-JUN, A-SEP, and A-DEC: 


In [234]: annual_frame. resample( 'Q-MAR' ) .ffillQ 
0ut[234]: 

Colorado Texas New York Ohio 


2000Q4 0.556703 
2001Q1 0.556703 
2001Q2 0.556703 
2001Q3 0.556703 
2001Q4 0.046303 
2002Q1 0.046303 
2002Q2 0.046303 
2002Q3 0.046303 


0.016631 

0.016631 

0.016631 

0.016631 

0.163344 

0.163344 

0.163344 

0.163344 


0.111873 

0.111873 

0.111873 

0.111873 

0.251503 

0.251503 

0.251503 

0.251503 


-0.027445 

-0.027445 

-0.027445 

-0.027445 

-0.157276 

-0.157276 

-0.157276 

-0.157276 


11.7 Moving Window Functions 

An important class of array transformations used for time series operations are statis¬ 
tics and other functions evaluated over a sliding window or with exponentially decay¬ 
ing weights. This can be useful for smoothing noisy or gappy data. I call these moving 
window functions, even though it includes functions without a fixed-length window 
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like exponentially weighted moving average. Like other statistical functions, these 
also automatically exclude missing data. 

Before digging in, we can load up some time series data and resample it to business 
day frequency: 

In [235]: close_px_all = pd.read_csv( 'examples/stock_px_2.csv' , 

.: parse_dates=True, index_col=0) 

In [236]: close_px = close_px_all[ [ 'AAPL' , 'MSFT', 'XOM 1 ]] 

In [237]: close_px = close_px. resample( 1 B' ). ffillQ 

I now introduce the rolling operator, which behaves similarly to resample and 
groupby. It can be called on a Series or DataFrame along with a window (expressed as 
a number of periods; see Figure 11-4 for the plot created): 


In [238]: close_px.AAPL.plot() 

0ut[238]: <matplotlib.axes._subplots.AxesSubplot at 0x7f2f2570cf98> 
In [239]: close_px.AAPL. rolling(250) ,nean( ). plot( ) 



The expression rolling(250) is similar in behavior to groupby, but instead of group¬ 
ing it creates an object that enables grouping over a 250-day sliding window. So here 
we have the 250-day moving window average of Apples stock price. 
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By default rolling functions require all of the values in the window to be non-NA. 
This behavior can be changed to account for missing data and, in particular, the fact 
that you will have fewer than window periods of data at the beginning of the time 
series (see Figure 11-5): 

In [241]: appl_std250 = close_px.AAPL.rolling(250, nln_pertods=10). std() 

In [242]: appl_std250[5:12] 

Out [242] : 

2003-01-09 NaN 

2003-01-10 NaN 

2003-01-13 NaN 

2003-01-14 NaN 

2003-01-15 0.077496 

2003-01-16 0.074760 

2003-01-17 0.112368 

Freq: B, Name: AAPL, dtype: float64 


In [243]: appl_std250.plot() 



In order to compute an expanding window mean, use the expanding operator instead 
of rolling. The expanding mean starts the time window from the beginning of the 
time series and increases the size of the window until it encompasses the whole series. 
An expanding window mean on the apple_std250 time series looks like this: 

In [244]: expanding_mean = appl_std250.expanding() .mean() 
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Calling a moving window function on a DataFrame applies the transformation to 
each column (see Figure 11-6): 



Figure 11-6. Stocks prices 60 - day MA (log Y-axis) 


The rolling function also accepts a string indicating a fixed-size time offset rather 
than a set number of periods. Using this notation can be useful for irregular time ser¬ 
ies. These are the same strings that you can pass to resample. For example, we could 
compute a 20-day rolling mean like so: 


In [247]: close_px.rolltng( '20D' ) ,nean( ) 


0ut[247] : 

2003-01-02 

2003-01-03 

2003-01-06 

2003-01-07 

2003-01-08 

2003-01-09 

2003-01-10 

2003-01-13 

2003-01-14 

2003-01-15 

2011-10-03 

2011-10-04 

2011-10-05 


AAPL 

7.400000 

7.425000 

7.433333 

7.432500 

7.402000 

7.391667 

7.387143 

7.378750 

7.370000 

7.355000 

398.002143 

396.802143 

395.751429 


MSFT 

21.110000 

21.125000 

21.256667 

21.425000 

21.402000 

21.490000 

21.558571 

21.633750 

21.717778 

21.757000 

25.890714 

25.807857 

25.729286 


XOM 

29.220000 

29.230000 

29.473333 

29.342500 

29.240000 

29.273333 

29.238571 

29.197500 

29.194444 

29.152000 

72.413571 

72.427143 

72.422857 
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2011-10-06 

2011-10-07 

2011 - 10-10 

2011 - 10-11 

2011 - 10-12 


394.099286 

392.479333 

389.351429 

388.505000 

388.531429 

388.826429 


25.673571 

25.712000 

25.602143 

25.674286 

25.810000 

25.961429 

26.048667 


72.375714 

72.454667 

72.527857 

72.835000 

73.400714 

73.905000 

74.185333 


2011-10-13 
2011-10-14 391.038000 

[2292 rows x 3 columns] 


Exponentially Weighted Functions 

An alternative to using a static window size with equally weighted observations is to 
specify a constant decay factor to give more weight to more recent observations. 
There are a couple of ways to specify the decay factor. A popular one is using a span, 
which makes the result comparable to a simple moving window function with win¬ 
dow size equal to the span. 

Since an exponentially weighted statistic places more weight on more recent observa¬ 
tions, it “adapts” faster to changes compared with the equal-weighted version. 

pandas has the ewm operator to go along with rolling and expanding. Here’s an 
example comparing a 60-day moving average of Apple’s stock price with an EW mov¬ 
ing average with span=60 (see Figure 11-7): 

In [249]: aapl_px = close_px. AAPL[' 20062007' ] 

In [250]: ma60 = aapl_px.rolling(30, min_periods=20).mean() 

In [251]: ewma60 = aapl_px.ewm(span=30) ,mean() 

In [252]: ma60.plot(style=' k--' , label=' Simple MA') 

0ut[252]: <matplotlib.axes._subplots.AxesSubplot at 0x7f2f252161d0> 

In [253]: ewma60.plot(style= 1 k-' , label='EW MA') 

0ut[253]: <matplotlib.axes._subplots.AxesSubplot at 0x7f2f252161d0> 

In [254]: plt.legendQ 
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Figure 11-7. Simple moving average versus exponentially weighted 


Binary Moving Window Functions 

Some statistical operators, like correlation and covariance, need to operate on two 
time series. As an example, financial analysts are often interested in a stock’s correla¬ 
tion to a benchmark index like the S&P 500. To have a look at this, we first compute 
the percent change for all of our time series of interest: 

In [256]: spx_px = close_px_all[ 1 SPX 1 ] 

In [257]: spx_rets = spx_px.pct_change() 

In [258]: returns = close_px.pct_change() 

The corr aggregation function after we call rolling can then compute the rolling 
correlation with spx_rets (see Figure 11-8 for the resulting plot): 

In [259]: corr = returns.AAPL. rolltng(125, min_perlods=100) .corr(spx_rets) 

In [260]: corr.plotQ 
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Suppose you wanted to compute the correlation of the S&P 500 index with many 
stocks at once. Writing a loop and creating a new DataFrame would be easy but might 
get repetitive, so if you pass a Series and a DataFrame, a function like rolling_corr 
will compute the correlation of the Series (spx_rets, in this case) with each column 
in the DataFrame (see Figure 11-9 for the plot of the result): 

In [262]: com = returns.rotling(125, nin_perlods=100),corr(spx_rets) 

In [263]: corr.plotQ 
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User-Defined Moving Window Functions 

The apply method on rolling and related methods provides a means to apply an 
array function of your own devising over a moving window. The only requirement is 
that the function produce a single value (a reduction) from each piece of the array. 
For example, while we can compute sample quantiles using rolling(...) .quan 
tile(q), we might be interested in the percentile rank of a particular value over the 
sample. The scipy.stats.percentileofscore function does just this (see 
Figure 11-10 for the resulting plot): 

In [265]: from import percentileofscore 

In [266]: score_at_2percent = lambda x: percentileofscore(x, 0.02) 

In [267]: result = returns.AAPL.rolllng(250).apply(score_at_2percent) 

In [268]: result.plot( ) 
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Figure 11-10. Percentile rank of 2% AAPL return over one-year window 
If you don’t have SciPy installed already, you can install it with conda or pip. 


11.8 Conclusion 

Time series data calls for different types of analysis and data transformation tools 
than the other types of data we have explored in previous chapters. 

In the following chapters, we will move on to some advanced pandas methods and 
show how to start using modeling libraries like statsmodels and scikit-learn. 
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CHAPTER 12 


Advanced pandas 


The preceding chapters have focused on introducing different types of data wrangling 
workflows and features of NumPy, pandas, and other libraries. Over time, pandas has 
developed a depth of features for power users. This chapter digs into a few more 
advanced feature areas to help you deepen your expertise as a pandas user. 

12.1 Categorical Data 

This section introduces the pandas Categorical type. I will show how you can ach¬ 
ieve better performance and memory use in some pandas operations by using it. I 
also introduce some tools for using categorical data in statistics and machine learning 
applications. 

Background and Motivation 

Frequently, a column in a table may contain repeated instances of a smaller set of dis¬ 
tinct values. We have already seen functions like unique and value_counts, which 
enable us to extract the distinct values from an array and compute their frequencies, 
respectively: 

In [10]: import as ; import as 

In [11]: values = pd.Series( [' apple 1 , 'orange', 'apple', 

. . .. : 1 apple' ] * 2) 


In [12]: values 
0ut[12] : 

0 apple 

1 orange 

2 apple 

3 apple 
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4 apple 

5 orange 

6 apple 

7 apple 
dtype: object 

In [13]: pd.unlque(values) 

0ut[13]: array( [ 1 apple' , 'orange'], dtype=object) 

In [14]: pd.value_counts(values) 

0ut[14] : 
apple 6 
orange 2 
dtype: lnt64 

Many data systems (for data warehousing, statistical computing, or other uses) have 
developed specialized approaches for representing data with repeated values for more 
efficient storage and computation. In data warehousing, a best practice is to use so- 
called dimension tables containing the distinct values and storing the primary obser¬ 
vations as integer keys referencing the dimension table: 

In [15]: values = pd. Serles( [0, 1, 0, 0] * 2) 

In [16]: din = pd.Serles( [' apple' , 'orange']) 

In [17]: values 
0ut[17] : 

0 0 
1 1 

2 0 

3 0 

4 0 

5 1 

6 0 

7 0 

dtype: int64 

In [18]: din 
0ut[18] : 

0 apple 

1 orange 

dtype: object 

We can use the take method to restore the original Series of strings: 

In [19]: din.take(values) 

0ut[19] : 

0 apple 

1 orange 

0 apple 

0 apple 

0 apple 

1 orange 


364 | Chapter 12: Advanced pandas 



0 apple 

0 apple 

dtype: object 

This representation as integers is called the categorical or dictionary-encoded repre¬ 
sentation. The array of distinct values can be called the categories, dictionary, or levels 
of the data. In this book we will use the terms categorical and categories. The integer 
values that reference the categories are called the category codes or simply codes. 

The categorical representation can yield significant performance improvements when 
you are doing analytics. You can also perform transformations on the categories while 
leaving the codes unmodified. Some example transformations that can be made at 
relatively low cost are: 

• Renaming categories 

• Appending a new category without changing the order or position of the existing 
categories 


Categorical Type in pandas 

pandas has a special Categorical type for holding data that uses the integer-based 
categorical representation or encoding. Let’s consider the example Series from before: 

In [20]: fruits = ['apple', 'orange', 'apple', 'apple'] * 2 

In [21]: N = len(fruits) 

In [22]: df = pd.DataFrame({ 'fruit' : fruits, 

....: 'basket_id': np.arange(N) , 

....: 'count': np.random.randint(3, 15, size=N), 

....: 'weight': np.random.uniform(0, 4, size=N)}, 

....: columns=[ 'basket_id' , 'fruit', 'count', 'weight']) 


In [23]: 
0ut[23] : 

df 




basket 

_id 

fruit 

count 

weight 

0 

0 

apple 

5 

3.858058 

1 

1 

orange 

8 

2.612708 

2 

2 

apple 

4 

2.995627 

3 

3 

apple 

7 

2.614279 

4 

4 

apple 

12 

2.990859 

5 

5 

orange 

8 

3.845227 

6 

6 

apple 

5 

0.033553 

7 

7 

apple 

4 

0.425778 


Here, df [' fruit' ] is an array of Python string objects. We can convert it to categori¬ 
cal by calling: 


12.1 Categorical Data | 365 










In [24]: fruit_cat = df [' fruit '] .astype(' category 1 ) 

In [25]: fruit_cat 
0ut[25] : 

0 apple 

1 orange 

2 apple 

3 apple 

4 apple 

5 orange 

6 apple 

7 apple 

Name: fruit, dtype: category 
Categories (2, object): [apple, orange] 

The values for fruit_cat are not a NumPy array, but an instance of pandas.Catego 
rical: 

In [26]: c = fruit_cat.values 
In [27]: type(c) 

0ut[27] : pandas.core.categorical.Categorical 

The Categorical object has categories and codes attributes: 

In [28]: c.categories 

0ut[28]: Index( [ 1 apple 1 , ’orange'], dtype= 'object 1 ) 

In [29]: c.codes 

0ut[29]: array([0, 1, 0, 0, 0, 1, 0, 0], dtype=int8) 

You can convert a DataFrame column to categorical by assigning the converted result: 

In [30]: df [' fruit 1 ] = df [' fruit 1 ]. astype( 'category' ) 

In [31]: df.fruit 
0ut[31] : 

0 apple 

1 orange 

2 apple 

3 apple 

4 apple 

5 orange 

6 apple 

7 apple 

Name: fruit, dtype: category 
Categories (2, object): [apple, orange] 

You can also create pandas.Categorical directly from other types of Python 
sequences: 

In [32]: my_categories = pd.Categorical( [ 'foo 1 , 'bar', 'baz', 'foo', 'bar']) 

In [33]: my_categories 
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Out [ 33 ]: 

[foo, bar, baz, foo, bar] 

Categories (3, object): [bar, baz, foo] 

If you have obtained categorical encoded data from another source, you can use the 
alternative from_codes constructor: 

In [34]: categories = ['foo', 'bar', 'baz'] 

In [35]: codes = [0, 1, 2, 0, 0, 1] 

In [36]: my_cats_2 = pd.Categorical.from_codes(codes, categories) 

In [37]: my_cats_2 
0ut[37] : 

[foo, bar, baz, foo, foo, bar] 

Categories (3, object): [foo, bar, baz] 

Unless explicitly specified, categorical conversions assume no specific ordering of the 
categories. So the categories array may be in a different order depending on the 
ordering of the input data. When using from_codes or any of the other constructors, 
you can indicate that the categories have a meaningful ordering: 

In [38]: ordered_cat = pd.Categorical.from_codes(codes, categories, 

....: ordered=T rue) 


In [39]: ordered_cat 
0ut[39] : 

[foo, bar, baz, foo, foo, bar] 

Categories (3, object): [foo < bar < baz] 

The output [foo < bar < baz] indicates that 'foo' precedes 'bar' in the ordering, 
and so on. An unordered categorical instance can be made ordered with as_ordered: 

In [40]: my_cats_2.as_ordered() 

Out[40] : 

[foo, bar, baz, foo, foo, bar] 

Categories (3, object): [foo < bar < baz] 

As a last note, categorical data need not be strings, even though I have only showed 
string examples. A categorical array can consist of any immutable value types. 

Computations with Categoricals 

Using Categorical in pandas compared with the non-encoded version (like an array 
of strings) generally behaves the same way. Some parts of pandas, like the groupby 
function, perform better when working with categoricals. There are also some func¬ 
tions that can utilize the ordered flag. 
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Let’s consider some random numeric data, and use the pandas.qcut binning func¬ 
tion. This return pandas.Categorical; we used pandas.cut earlier in the book but 
glossed over the details of how categoricals work: 

In [41]: np.random. seed(12345) 

In [42]: draws = np.random. randn(1000) 

In [43]: draws[:5] 

0ut[43] : array([-0.2047, 0.4789, -0.5194, -0.5557, 1.9658]) 

Let’s compute a quartile binning of this data and extract some statistics: 

In [44]: bins = pd.qcut(draws, 4) 

In [45]: bins 
0ut[45] : 

[(-0.684, -0.0101], (-0.0101, 0.63], (-0.684, -0.0101], (-0.684, -0.0101], (0.63, 
3.928], ..., (-0.0101, 0.63], (-0.684, -0.0101], (-2.95, -0.684], (-0.0101, 0.63 
], (0.63, 3.928]] 

Length: 1000 

Categories (4, interval[float64] ): [(-2.95, -0.684] < (-0.684, -0.0101] < (-0.010 
1, 0.63] < 

(0.63, 3.928]] 

While useful, the exact sample quartiles may be less useful for producing a report 
than quartile names. We can achieve this with the labels argument to qcut: 

In [46]: bins = pd.qcut(draws, 4, labels=[ 'Q1 1 , 'Q2' , ' Q3' , 'Q4' ]) 

In [47]: bins 
0ut[47] : 

[Q2, Q3, Q2 , Q2 , Q4, ..., Q3, Q2, Ql, Q3, Q4] 

Length: 1000 

Categories (4, object): [Ql < Q2 < Q3 < Q4] 

In [48]: bins.codes[ : 10] 

0ut[48]: array([l, 2, 1, 1, 3, 3, 2, 2, 3, 3], dtype=int8) 

The labeled bins categorical does not contain information about the bin edges in the 
data, so we can use groupby to extract some summary statistics: 

In [49]: bins = pd.Series(bins, name= 'quartile 1 ) 

In [50]: results = (pd.Series(draws) 

....: .groupby(bins) 

....: . agg( [' count' , 'min', 'max']) 

....: .reset_index()) 

In [51]: results 
0ut[51] : 

quartile count min max 

0 Ql 250 -2.949343 -0.685484 
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1 

2 

3 


Q2 250 -0.683066 -0.010115 
Q3 250 -0.010032 0.628894 
Q4 250 0.634238 3.927528 


The 'quartile' column in the result retains the original categorical information, 
including ordering, from bins: 

In [52]: resultsf 'quartile'] 

0ut[52] : 

0 Q1 

1 Q2 

2 Q3 

3 Q4 

Name: quartile, dtype: category 
Categories (4, object): [Q1 < Q2 < Q3 < Q4] 

Better performance with categoricals 

If you do a lot of analytics on a particular dataset, converting to categorical can yield 
substantial overall performance gains. A categorical version of a DataFrame column 
will often use significantly less memory, too. Let’s consider some Series with 10 mil¬ 
lion elements and a small number of distinct categories: 

In [53]: N = 10000000 

In [54]: draws = pd.Series(np.random. randn(N)) 

In [55]: labels = pd.Series( [ 'foo' , 'bar', 'baz', 'qux'] * (N // 4)) 

Now we convert labels to categorical: 

In [56]: categories = labels.astype( 'category' ) 

Now we note that labels uses significantly more memory than categories: 

In [57]: labels.memory_usage( ) 

0ut[57] : 80000080 

In [58]: categories.memory_usage() 

0ut[58] : 10000272 

The conversion to category is not free, of course, but it is a one-time cost: 

In [59]: %tine _ = labels.astype(' category' ) 

CPU tines: user 490 ns, sys: 240 ns, total: 730 ns 
Wall tine: 726 ns 

GroupBy operations can be significantly faster with categoricals because the underly¬ 
ing algorithms use the integer-based codes array instead of an array of strings. 


12.1 Categorical Data | 369 




Categorical Methods 

Series containing categorical data have several special methods similar to the Ser 
ies. str specialized string methods. This also provides convenient access to the cate¬ 
gories and codes. Consider the Series: 

In [60]: s = pd.Series( [' a' , 'b', 'c', ' d' ] * 2) 

In [61]: cat_s = s.astype(' category 1 ) 

In [62]: cat_s 
0ut[62] : 

0 a 

1 b 

2 c 

3 d 

4 a 

5 b 

6 c 

7 d 

dtype: category 

Categories (4, object): [a, b, c, d] 

The special attribute cat provides access to categorical methods: 

In [63]: cat_s.cat.codes 
0ut[63] : 

0 0 
1 1 
2 2 

3 3 

4 0 

5 1 

6 2 

7 3 

dtype: int8 

In [64]: cat_s.cat.categories 

0ut[64]: Index(['a', 'b', ' c' , ' d' ], dtype=' object' ) 

Suppose that we know the actual set of categories for this data extends beyond the 
four values observed in the data. We can use the set_categories method to change 
them: 

In [65]: actual_categories = ['a 1 , 'b' , 'c', ' d' , 'e 1 ] 

In [66]: cat_s2 = cat_s.cat.set_categories(actuai_categories) 

In [67]: cat_s2 
0ut[67] : 

0 a 
1 b 
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2 c 

3 d 

4 a 

5 b 

6 c 

7 d 

dtype: category 

Categories (5, object): [a, b, c, d, e] 

While it appears that the data is unchanged, the new categories will be reflected in 
operations that use them. For example, value_counts respects the categories, if 
present: 

In [68]: cat_s.va!ue_counts() 

0ut[68] : 
d 2 
c 2 
b 2 
a 2 

dtype: tnt64 

In [69]: cat_s2.value_counts() 

0ut[69] : 
d 2 
c 2 
b 2 
a 2 
e 0 

dtype: tnt64 

In large datasets, categoricals are often used as a convenient tool for memory savings 
and better performance. After you filter a large DataFrame or Series, many of the 
categories may not appear in the data. To help with this, we can use the 
remove_unused_categories method to trim unobserved categories: 

In [70]: cat_s3 = cat_s[cat_s.lsln( [ 'a' , ' b' ])] 

In [71]: cat_s3 
0ut[71] : 

0 a 
1 b 

4 a 

5 b 

dtype: category 

Categories (4, object): [a, b, c, d] 

In [72]: cat_s3.cat.remove_unused_categorles( ) 

0ut[72] : 

0 a 
1 b 

4 a 

5 b 
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dtype: category 

Categories (2, object): [a, b] 

See Table 12-1 for a listing of available categorical methods. 


Table 12-1. Categorical methods for Series in pandas 


1 Method 

Description J 

add_categori.es 

Append new (unused) categories at end of existing categories 

as_ordered 

Make categories ordered 

as_unordered 

Make categories unordered 

remove_categori.es 

Remove categories, setting any removed values to null 

remove_unused_categories 

Remove any category values which do not appear in the data 

rename_categori.es 

Replace categories with indicated set of new category names; cannot change the 
number of categories 

reorder_categori.es 

Behaves like rename_categori.es, but can also change the result to have ordered 
categories 

set_categori.es 

Replace the categories with the indicated set of new categories; can add or remove 
categories 


Creating dummy variables for modeling 

When you’re using statistics or machine learning tools, you’ll often transform catego¬ 
rical data into dummy variables, also known as one-hot encoding. This involves creat¬ 
ing a DataFrame with a column for each distinct category; these columns contain Is 
for occurrences of a given category and 0 otherwise. 

Consider the previous example: 

In [73]: cat_s = pd.Series( [ 'a 1 , ' b' , 'c' , 'd' ] * 2, dtype= 1 category' ) 

As mentioned previously in Chapter 7, the pandas .get_dummies function converts 
this one-dimensional categorical data into a DataFrame containing the dummy 
variable: 

In [74]: pd.get_dummies(cat_s) 

0ut[74] : 

abed 
0 10 0 0 
10 10 0 
2 0 0 1 0 

3 0 0 0 1 

4 10 0 0 

5 0 10 0 

6 0 0 1 0 

7 0 0 0 1 
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12.2 Advanced GroupBy Use 

While we’ve already discussed using the groupby method for Series and DataFrame in 
depth in Chapter 10, there are some additional techniques that you may find of use. 

Group Transforms and "Unwrapped" GroupBys 

In Chapter 10 we looked at the apply method in grouped operations for performing 
transformations. There is another built-in method called transform, which is similar 
to apply but imposes more constraints on the kind of function you can use: 

• It can produce a scalar value to be broadcast to the shape of the group 

• It can produce an object of the same shape as the input group 

• It must not mutate its input 

Let’s consider a simple example for illustration: 

In [75]: df = pd.DataFrame({ 1 key' : ['a', ' b' , 'c'] * 4, 

....: 'value': np.arange(12. )}) 


In [76]: df 
0ut[76] : 

key value 
0 a 0.0 

1 b 1.0 

2 c 2.0 

3 a 3.0 

4 b 4.0 

5 c 5.0 

6 a 6.0 

7 b 7.0 

8 c 8.0 

9 a 9.0 

10 b 10.0 

11 c 11.0 

Here are the group means by key: 

In [77]: g = df.groupby( 'key ') .value 

In [78]: g.mean() 

0ut[78] : 
key 

a 4.5 

b 5.5 

c 6.5 

Name: value, dtype: float64 
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Suppose instead we wanted to produce a Series of the same shape as df [ ' value 1 ] but 
with values replaced by the average grouped by 'key'. We can pass the function 
lambda x: x.meanQ to transform: 

In [79]: g . transform(lambda x: x.meanQ) 

0ut[79] : 

0 4.5 

1 5.5 

2 6.5 

3 4.5 

4 5.5 

5 6.5 

6 4.5 

7 5.5 

8 6.5 

9 4.5 

10 5.5 

11 6.5 

Name: value, dtype: float64 

For built-in aggregation functions, we can pass a string alias as with the GroupBy agg 
method: 

In [80]: g.transform( 1 mean' ) 

Out[80] : 

0 4.5 

1 5.5 

2 6.5 

3 4.5 

4 5.5 

5 6.5 

6 4.5 

7 5.5 

8 6.5 

9 4.5 

10 5.5 

11 6.5 

Name: value, dtype: float64 

Like apply, transform works with functions that return Series, but the result must be 
the same size as the input. For example, we can multiply each group by 2 using a 
lambda function: 

In [81]: g. transform(lambda x: x * 2) 

0ut[81] : 

0 0.0 

1 2.0 

2 4.0 

3 6.0 

4 8.0 

5 10.0 

6 12.0 
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7 14.0 

8 16.0 

9 18.0 

10 20.0 

11 22.0 

Name: value, dtype: float64 

As a more complicated example, we can compute the ranks in descending order for 
each group: 

In [82]: g. transfom(lanbda x: x.rank(ascendlng=False)) 

0ut[82] : 

0 4.0 

1 4.0 

2 4.0 

3 3.0 

4 3.0 

5 3.0 

6 2.0 

7 2.0 

8 2.0 

9 1.0 

10 1.0 

11 1.0 

Name: value, dtype: float64 

Consider a group transformation function composed from simple aggregations: 

def normallze(x) : 

return (x - x.mean()) / x.stdQ 

We can obtain equivalent results in this case either using transform or apply: 

In [84]: g.transfom(nornallze) 

0ut[84] : 

0 -1.161895 

1 -1.161895 

2 -1.161895 

3 -0.387298 

4 -0.387298 

5 -0.387298 

6 0.387298 

7 0.387298 

8 0.387298 

9 1.161895 

10 1.161895 

11 1.161895 

Name: value, dtype: float64 

In [85]: g.apply(nornalize) 

0ut[85] : 

0 -1.161895 

1 -1.161895 

2 -1.161895 
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3 -0.387298 

4 -0.387298 

5 -0.387298 

6 0.387298 

7 0.387298 

8 0.387298 

9 1.161895 

10 1.161895 

11 1.161895 

Name: value, dtype: float64 

Built-in aggregate functions like ' mean' or ' sum' are often much faster than a general 
apply function. These also have a “fast past” when used with transform. This allows 
us to perform a so-called unwrapped group operation: 

In [86]: g .transform( 'mean' ) 

0ut[86] : 

0 4.5 

1 5.5 

2 6.5 

3 4.5 

4 5.5 

5 6.5 

6 4.5 

7 5.5 

8 6.5 

9 4.5 

10 5.5 

11 6.5 

Name: value, dtype: float64 


In [87]: normalized = (df['value'] - g.transform( 'mean' )) / g.transform( 'std' ) 


In [88]: normalized 
0ut[88] : 

0 -1.161895 

1 -1.161895 

2 -1.161895 

3 -0.387298 

4 -0.387298 

5 -0.387298 

6 0.387298 

7 0.387298 

8 0.387298 

9 1.161895 

10 1.161895 

11 1.161895 

Name: value, dtype: float64 


While an unwrapped group operation may involve multiple group aggregations, the 
overall benefit of vectorized operations often outweighs this. 
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Grouped Time Resampling 

For time series data, the resample method is semantically a group operation based on 
a time intervalization. Here’s a small example table: 

In [89]: N = 15 


In [96]: times = pd.date_range( '2017-05-20 00:00', freq='lmin', periods=N) 


In [91]: df = pd.DataFrame({ 'time' : times, 

’value 1 : np.arange(N)}) 


In [92]: df 
0ut[92] : 

0 2017-05-20 

1 2017-05-20 

2 2017-05-20 

3 2017-05-20 

4 2017-05-20 

5 2017-05-20 

6 2017-05-20 

7 2017-05-20 

8 2017-05-20 

9 2017-05-20 

10 2017-05-20 

11 2017-05-20 

12 2017-05-20 

13 2017-05-20 

14 2017-05-20 



time 

value 

00 

00 

00 

0 

00 

01 

00 

1 

00 

02 

00 

2 

00 

03 

00 

3 

00 

04 

00 

4 

00 

05 

00 

5 

00 

06 

00 

6 

00 

07 

00 

7 

00 

08 

00 

8 

00 

09 

00 

9 

00 

10 

00 

10 

00 

11 

00 

11 

00 

12 

00 

12 

00 

13 

00 

13 

00 

14 

00 

14 


Here, we can index by 1 time 1 and then resample: 


In [93]: df.set_index( 'time' ). resample( 1 5min 
0ut[93] : 

value 


time 


2017-05-20 00:00:00 5 
2017-05-20 00:05:00 5 
2017-05-20 00:10:00 5 


count( ) 


Suppose that a DataFrame contains multiple time series, marked by an additional 
group key column: 

In [94]: df2 = pd .DataFrame({ 'time' : times. repeat(3) , 

'key': np.tiie( [ 1 a' , 'b', 'c' ], N), 

'value': np.arange(N * 3.)}) 


In [95]: df2[:7] 

0ut[95] : 

key time value 

0 a 2017-05-20 00:00:00 0.0 

1 b 2017-05-20 00:00:00 1.0 
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2 c 2017-05-20 00:00:00 2.0 

3 a 2017-05-20 00:01:00 3.0 

4 b 2017-05-20 00:01:00 4.0 

5 c 2017-05-20 00:01:00 5.0 

6 a 2017-05-20 00:02:00 6.0 

To do the same resampling for each value of 'key', we introduce the pandas.Time 
Grouper object: 

In [96]: time_key = pd.TimeGrouper( 1 5mln' ) 

We can then set the time index, group by 1 key 1 and time_key, and aggregate: 

In [97]: resampled = (df2.set_lndex( 'time' ) 

....: . groupby( [ 1 key' , time_key]) 

-: •sum()) 


In [98]: resampled 
0ut[98] : 





value 

key time 



a 

2017-05-20 

00:00:00 

30.0 


2017-05-20 

00:05:00 

105.0 


2017-05-20 

00:10:00 

180.0 

b 

2017-05-20 

00:00:00 

35.0 


2017-05-20 

00:05:00 

110.0 


2017-05-20 

00:10:00 

185.0 

c 

2017-05-20 

00:00:00 

40.0 


2017-05-20 

00:05:00 

115.0 


2017-05-20 

00:10:00 

190.0 


In [99]: resampled.reset_lndex( ) 
0ut[99] : 



key 


time 

value 

0 

a 

2017-05-20 

00:00:00 

30.0 

1 

a 

2017-05-20 

00:05:00 

105.0 

2 

a 

2017-05-20 

00:10:00 

180.0 

3 

b 

2017-05-20 

00:00:00 

35.0 

4 

b 

2017-05-20 

00:05:00 

110.0 

5 

b 

2017-05-20 

00:10:00 

185.0 

6 

c 

2017-05-20 

00:00:00 

40.0 

7 

c 

2017-05-20 

00:05:00 

115.0 

8 

c 

2017-05-20 

00:10:00 

190.0 


One constraint with using TimeGrouper is that the time must be the index of the Ser¬ 
ies or DataFrame. 


12.3 Techniques for Method Chaining 

When applying a sequence of transformations to a dataset, you may find yourself cre¬ 
ating numerous temporary variables that are never used in your analysis. Consider 
this example, for instance: 


378 | Chapter 12: Advanced pandas 






df = load_data() 

df2 = df [df [' col2' ] < 0] 

df2[ 1 coll_defieaned' ] = df2 [' coll' ] - df2[ 'coll 1 ] .nean() 
result = df2,groupby( 'key' ) .coll_demeaned.std( ) 

While were not using any real data here, this example highlights some new methods. 
First, the DataFrame.assign method is a functional alternative to column assign¬ 
ments of the form df [k] = v. Rather than modifying the object in-place, it returns a 
new DataFrame with the indicated modifications. So these statements are equivalent: 

# Usual non-functional way 
df2 = df.copy() 

df2[' k' ] = v 

# Functional assign way 
df2 = df.asslgn(k=v) 

Assigning in-place may execute faster than using assign, but assign enables easier 
method chaining: 

result = (df2.asstgn(coll_demeaned=df2.coll - df2.col2.mean()) 

.groupby( 'key' ) 

.coll_deneaned.std()) 

I used the outer parentheses to make it more convenient to add line breaks. 

One thing to keep in mind when doing method chaining is that you may need to 
refer to temporary objects. In the preceding example, we cannot refer to the result of 
load_data until it has been assigned to the temporary variable df. To help with this, 
assign and many other pandas functions accept function-like arguments, also known 
as callables. 

To show callables in action, consider a fragment of the example from before: 

df = load_data() 

df2 = df [df[' col2' ] < 0] 

This can be rewritten as: 

df = (load_data() 

[lanbda x: x['col2'] < 0]) 

Here, the result of load_data is not assigned to a variable, so the function passed into 
[ ] is then bound to the object at that stage of the method chain. 

We can continue, then, and write the entire sequence as a single chained expression: 

result = (load_data() 

[lanbda x: x.col2 < 0] 

.asslgn(coll_deneaned=lanbda x: x.coll - x.coll.nean()) 

.groupby( 'key' ) 

.coll_deneaned.std()) 
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Whether you prefer to write code in this style is a matter of taste, and splitting up the 
expression into multiple steps may make your code more readable. 

The pipe Method 

You can accomplish a lot with built-in pandas functions and the approaches to 
method chaining with callables that we just looked at. However, sometimes you need 
to use your own functions or functions from third-party libraries. This is where the 
pipe method comes in. 

Consider a sequence of function calls: 

a = f(df, argl=vl) 
b = g(a, v2, arg3=v3) 
c = h(b, arg4=v4) 

When using functions that accept and return Series or DataFrame objects, you can 
rewrite this using calls to pipe: 

result = (df.pipe(f, argl=vl) 

.pipe(g, v2, arg3=v3) 

.pipe(h, arg4=v4)) 

The statement f (df) and df. pipe(f) are equivalent, but pipe makes chained invoca¬ 
tion easier. 

A potentially useful pattern for pipe is to generalize sequences of operations into 
reusable functions. As an example, let’s consider substracting group means from a 
column: 

g = df.groupby( [ 1 keyl 1 , 'key2' ]) 
df['coll'] = df [ 1 coll' ] - g. transfom( 1 mean 1 ) 

Suppose that you wanted to be able to demean more than one column and easily 
change the group keys. Additionally, you might want to perform this transformation 
in a method chain. Here is an example implementation: 

def group_demean(df , by, cols): 
result = df.copyO 
g = df.groupby(by) 
for c in cols: 

result[c] = df[c] - g[c] .transform^ 'mean' ) 
return result 

Then it is possible to write: 

result = (df[df.coll < 0] 

.pipe(group_demean, ['keyl 1 , 'key2'], ['coll'])) 
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12.4 Conclusion 

pandas, like many open source software projects, is still changing and acquiring new 
and improved functionality. As elsewhere in this book, the focus here has been on the 
most stable functionality that is less likely to change over the next several years. 

To deepen your expertise as a pandas user, I encourage you to explore the documen¬ 
tation and read the release notes as the development team makes new open source 
releases. We also invite you to join in on pandas development: fixing bugs, building 
new features, and improving the documentation. 
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CHAPTER 13 


Introduction to Modeling Libraries 

in Python 


In this book, I have focused on providing a programming foundation for doing data 
analysis in Python. Since data analysts and scientists often report spending a dispro¬ 
portionate amount of time with data wrangling and preparation, the book’s structure 
reflects the importance of mastering these techniques. 

Which library you use for developing models will depend on the application. Many 
statistical problems can be solved by simpler techniques like ordinary least squares 
regression, while other problems may call for more advanced machine learning 
methods. Fortunately, Python has become one of the languages of choice for imple¬ 
menting analytical methods, so there are many tools you can explore after completing 
this book 

In this chapter, I will review some features of pandas that may be helpful when you’re 
crossing back and forth between data wrangling with pandas and model fitting and 
scoring. I will then give short introductions to two popular modeling toolkits, stats- 
models and scikit-learn. Since each of these projects is large enough to warrant its 
own dedicated book, I make no effort to be comprehensive and instead direct you to 
both projects’ online documentation along with some other Python-based books on 
data science, statistics, and machine learning. 

13.1 Interfacing Between pandas and Model Code 

A common workflow for model development is to use pandas for data loading and 
cleaning before switching over to a modeling library to build the model itself. An 
important part of the model development process is called feature engineering in 
machine learning. This can describe any data transformation or analytics that extract 
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information from a raw dataset that may be useful in a modeling context. The data 
aggregation and GroupBy tools we have explored in this book are used often in a fea¬ 
ture engineering context. 

While details of “good” feature engineering are out of scope for this book, I will show 
some methods to make switching between data manipulation with pandas and mod¬ 
eling as painless as possible. 

The point of contact between pandas and other analysis libraries is usually NumPy 
arrays. To turn a DataFrame into a NumPy array, use the .values property: 

In [10]: import as 

In [11]: import as 

In [12]: data = pd.DataFrame({ 

....: 'x0' : [1, 2, 3, 4, 5], 

_: ' xl' : [0.01, -0.01, 0.25, -4.1, 0], 

....: V: [-1.5, 0., 3.6, 1.3, -2.]}) 


In [13]: data 
0ut[13] : 

x0 xl y 
0 1 0.01 -1.5 

1 2 - 0.01 0.0 

2 3 0.25 3.6 

3 4 -4.10 1.3 

4 5 0.00 -2.0 

In [14]: data.columns 

0ut[14]: Index(['x0', 'xl', 1 y' ], dtype=' object' ) 

In [15]: data.values 
0ut[15] : 

array([[ 1. , 0.01, -1.5 ], 

[ 2 . , - 0 . 01 , 0 . ], 

[ 3. , 0.25, 3.6 ], 

[ 4. , -4.1 , 1.3 ], 

[5. ,0. , -2. ]]) 

To convert back to a DataFrame, as you may recall from earlier chapters, you can pass 
a two-dimensional ndarray with optional column names: 

In [16]: df2 = pd.DataFrame(data.values, columns=[ 'one' , 'two', 'three']) 


In [17]: df2 
0ut[17] : 



one 

two 

three 

0 

1.0 

0.01 

-1.5 

1 

2.0 

-0.01 

0.0 

2 

3.0 

0.25 

3.6 
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3 4.0 -4.10 1.3 

4 5.0 0.00 -2.0 


i 


The .values attribute is intended to be used when your data is 
homogeneous—for example, all numeric types. If you have hetero¬ 
geneous data, the result will be an ndarray of Python objects: 


In [18]: df3 = data.copyO 


In [19]: df3[ 1 strings 1 ] = ['a 1 , ' b' , 'c', 'd' , 'e'] 


In [20]: df3 
Out[20] : 


x0 xl y strings 
0 1 0.01 -1.5 a 

1 2 -0.01 0.0 b 

2 3 0.25 3.6 c 

3 4 -4.10 1.3 d 

4 5 0.00 -2.0 e 


In [21]: df3.values 
0ut[21] : 

array([[l, 0.01, -1.5, 'a'], 

[2, -0.01, 0.0, 'b'], 

[3, 0.25, 3.6, 'c'], 

[4, -4.1, 1.3, ’d' ], 

[5, 0.0, -2.0, 'e' ]], dtype=object) 


For some models, you may only wish to use a subset of the columns. I recommend 
using loc indexing with values: 

In [22]: model_cols = [ 'x0' , ’xl 1 ] 

In [23]: data.loc[:, model_cols].values 
0ut[23] : 

array([[ 1. , 0.01], 

[ 2 - , - 0 . 01 ], 

[ 3. , 0.25], 

[ 4. , -4.1 ], 

[ 5. , 0. ]]) 

Some libraries have native support for pandas and do some of this work for you auto¬ 
matically: converting to NumPy from DataFrame and attaching model parameter 
names to the columns of output tables or Series. In other cases, you will have to per¬ 
form this “metadata management” manually. 

In Chapter 12 we looked at pandas’s Categorical type and the pandas.get_dumies 
function. Suppose we had a non-numeric column in our example dataset: 
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In [24]: data[ 'category' ] = pd .Categorlcal( [' a' , 'b' , 'a', 'a', 'b'], 

categories=[' a' , ' b' ]) 


In [25]: data 
0ut[25] : 

x0 xl y category 


0 1 0.01 -1.5 a 

1 2 -0.01 0.0 b 

2 3 0.25 3.6 a 

3 4 -4.10 1.3 a 

4 5 0.00 -2.0 b 


If we wanted to replace the 'category' column with dummy variables, we create 
dummy variables, drop the 'category' column, and then join the result: 

In [26]: dummies = pd.get_dummies(data.category, preflx= 'category' ) 

In [27]: data_with_dummies = data.drop( 'category' , axis=l). join(dummies) 


In [28]: data_wlth_dummies 
0ut[28] : 

x0 xl y category_a 


0 1 0.01 -1.5 1 

1 2 - 0.01 0.0 0 

2 3 0.25 3.6 1 

3 4 -4.10 1.3 1 

4 5 0.00 -2.0 0 


category_b 

0 

1 

0 

0 

1 


There are some nuances to fitting certain statistical models with dummy variables. It 
may be simpler and less error-prone to use Patsy (the subject of the next section) 
when you have more than simple numeric columns. 


13.2 Creating Model Descriptions with Patsy 

Patsy is a Python library for describing statistical models (especially linear models) 
with a small string-based “formula syntax,” which is inspired by (but not exactly the 
same as) the formula syntax used by the R and S statistical programming languages. 

Patsy is well supported for specifying linear models in statsmodels, so I will focus on 
some of the main features to help you get up and running. Patsy’s formulas are a spe¬ 
cial string syntax that looks like: 

y ~ X0 + xl 

The syntax a + b does not mean to add a to b, but rather that these are terms in the 
design matrix created for the model. The patsy .dmatrices function takes a formula 
string along with a dataset (which can be a DataFrame or a diet of arrays) and pro¬ 
duces design matrices for a linear model: 

In [29]: data = pd.DataFrame({ 

....: 'x0': [1, 2, 3, 4, 5], 
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xl' : [0.01, -0.01, 0.25, -4.1, 0], 
y' : [-1.5, 0., 3.6, 1.3, -2.]}) 


In [30]: data 
Out[30] : 

x0 xl y 
0 1 0.01 -1.5 

1 2 - 0.01 0.0 

2 3 0.25 3.6 

3 4 -4.10 1.3 

4 5 0.00 -2.0 

In [31]: import 

In [32]: y, X = patsy.dmatrices( 'y - x0 + xl', data) 

Now we have: 

In [33]: y 
0ut[33] : 

DesignMatrix with shape (5, 1) 

y 

-1.5 
0.0 
3.6 
1.3 
- 2.0 
Terms : 

'y' (column 0) 

In [34]: X 
0ut[34] : 

DesignMatrix with shape (5, 3) 

Intercept x0 xl 
1 1 0.01 
1 2 -0.01 
1 3 0.25 

1 4 -4.10 

1 5 0.00 

Terms : 

'Intercept' (column 0) 

'x0' (column 1) 

'xl' (column 2) 

These Patsy DesignMatrix instances are NumPy ndarrays with additional metadata: 

In [35]: np.asarray(y) 

0ut[35] : 
array( [ [-1.5], 

[ 0 - ], 

[ 3.6], 

[ 1.3], 

[- 2 . ]]) 
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In [36]: np.asarray(X) 
Out[36] : 


array( [[ 1. 

, 1. 

, 0-01], 

[ 1. 

, 2. 

, -0.01], 

[ 1. 

, 3. 

, 0.25], 

[ 1. 

, 4. 

, -4.1 ], 

[ 1. 

, 5. 

. 0- ID 


You might wonder where the Intercept term came from. This is a convention for 
linear models like ordinary least squares (OLS) regression. You can suppress the 
intercept by adding the term + 0 to the model: 

In [37]: patsy.dnatrtces( 'y ~ x0 + xl + 0', data)[l] 

0ut[37] : 

DestgnMatrix with shape (5, 2) 
x0 xl 
1 0.01 
2 - 0.01 

3 0.25 

4 -4.10 

5 0.00 
Terms : 

'x0' (column 0) 

'xl' (column 1) 

The Patsy objects can be passed directly into algorithms like numpy.linalg.lstsq, 
which performs an ordinary least squares regression: 

In [38]: coef, resid, _, _ = np.linalg.lstsq(X, y) 

The model metadata is retained in the design_info attribute, so you can reattach the 
model column names to the fitted coefficients to obtain a Series, for example: 

In [39]: coef 
0ut[39] : 

array([[ 0.3129], 

[-0.0791], 

[-0.2655]]) 

In [40]: coef = pd.Serles(coef ,squeeze(), tndex=X.design_info.column_names) 

In [41]: coef 
0ut[41] : 

Intercept 0.312910 

X0 -0.079106 

xl -0.265464 

dtype: float64 
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Data Transformations in Patsy Formulas 

You can mix Python code into your Patsy formulas; when evaluating the formula the 
library will try to find the functions you use in the enclosing scope: 

In [42]: y, X = patsy.dnatrtces( 'y ~ x0 + np.tog(np.abs(xl) + 1)', data) 

In [43]: X 
0ut[43] : 

DesignMatrlx with shape (5, 3) 

Intercept x0 np.tog(np.abs(xl) + 1) 

1 1 0.00995 

1 2 0.00995 

1 3 0.22314 

1 4 1.62924 

1 5 0.00000 

Terms : 

'Intercept' (column 0) 

'x0' (column 1) 

'np.log(np.abs(xl) + 1)' (column 2) 

Some commonly used variable transformations include standardizing (to mean 0 and 
variance 1) and centering (subtracting the mean). Patsy has built-in functions for this 
purpose: 

In [44]: y, X = patsy.dmatrlces( 'y - standardlze(x0) + center(xl)', data) 

In [45]: X 
0ut[45] : 

DesignMatrlx with shape (5, 3) 

Intercept standardize(x0) center(xl) 


1 

-1.41421 

0.78 

1 

-0.70711 

0.76 

1 

0.00000 

1.02 

1 

0.70711 

-3.33 

1 

1.41421 

0.77 


Terms : 

'Intercept' (column 0) 

’standardize^©)' (column 1) 

'center(xl)' (column 2) 

As part of a modeling process, you may fit a model on one dataset, then evaluate the 
model based on another. This might be a hold-out portion or new data that is 
observed later. When applying transformations like center and standardize, you 
should be careful when using the model to form predications based on new data. 
These are called stateful transformations, because you must use statistics like the 
mean or standard deviation of the original dataset when transforming a new dataset. 

The patsy.build_design_matrices function can apply transformations to new out- 
of-sample data using the saved information from the original in-sample dataset: 
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In [46]: new_data = pd.DataFrame({ 

'X0': [6, 7, 8, 9], 

_: 'xl' : [3.1, -0.5, 0, 2.3], 

V: [1, 2, 3, 4]}) 

In [47]: new_X = patsy.build_design_matrices([X.desi.gn_i.nfo] , new_data) 

In [48]: new_X 
0ut[48] : 


[DesignMatrlx 

Intercept 

with shape (4, 3) 
standardize(xO) 

center(xl) 

1 

2.12132 

3.87 

1 

2.82843 

0.27 

1 

3.53553 

0.77 

1 

4.24264 

3.07 


Terns : 

'Intercept' (column 0) 

'standardlze(xO)' (column 1) 

'center(xl)' (column 2)] 

Because the plus symbol (+) in the context of Patsy formulas does not mean addition, 
when you want to add columns from a dataset by name, you must wrap them in the 
special I function: 

In [49]: y, X = patsy.dmatrlces( 'y ~ I(x0 + xl)', data) 

In [50]: X 
Out[50] : 

DesignMatrlx with shape (5, 2) 

Intercept I(x0 + xl) 

1 1.01 

1 1.99 

1 3.25 

1 - 0.10 

1 5.00 

Terms : 

'Intercept' (column 0) 

'I(x0 + xl)' (column 1) 

Patsy has several other built-in transforms in the patsy.builtins module. See the 
online documentation for more. 

Categorical data has a special class of transformations, which I explain next. 

Categorical Data and Patsy 

Non-numeric data can be transformed for a model design matrix in many different 
ways. A complete treatment of this topic is outside the scope of this book and would 
be best studied along with a course in statistics. 
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When you use non-numeric terms in a Patsy formula, they are converted to dummy 
variables by default. If there is an intercept, one of the levels will be left out to avoid 
collinearity: 

In [51]: data = pd.DataFrame({ 

-: ' keyl 1 : ['a', 'a', 'b', 'b', 'a', ' b 1 , 'a', ’ b' ], 

'key2 1 : [0, 1, 0, 1, 0, 1, 0, 0], 

'vl': [1, 2, 3, 4, 5, 6, 7, 8], 

' m 2 ' : [1, 0, 2.5, -0.5, 4.0, -1.2, 0.2, -1.7] 

}) 

In [52]: y, X = patsy.dnatrtces( 'v2 - keyl', data) 

In [53]: X 
Out[53] : 

DesignMatrix with shape (8, 2) 

Intercept keyl[T.b] 

1 0 

1 0 

1 1 

1 1 

1 0 

1 1 

1 0 

1 1 

Terns : 

'Intercept' (column 0) 

'keyl' (column 1) 

If you omit the intercept from the model, then columns for each category value will 
be included in the model design matrix: 

In [54]: y, X = patsy.dmatrices( 'v2 ~ keyl + 0', data) 

In [55]: X 
0ut[55] : 

DesignMatrix with shape (8, 2) 
keyl[a] keyl[b] 

1 0 

1 0 

0 1 

0 1 

1 0 

0 1 

1 0 

0 1 

Terms : 

'keyl' (columns 0:2) 

Numeric columns can be interpreted as categorical with the C function: 

In [56]: y, X = patsy.dmatrices( 'v2 - C(key2)' , data) 
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In [57]: X 
Out[57] : 

DesignMatrix with shape (8, 2) 


Intercept C(key2)[T.l] 


1 

1 

1 

1 

1 

1 

1 

1 


0 

1 

0 

1 

0 

1 

0 

0 


Terns : 

'Intercept' (column 0) 

'C(key2)' (column 1) 

When you’re using multiple categorical terms in a model, things can be more compli¬ 
cated, as you can include interaction terms of the form keyl:key2, which can be 
used, for example, in analysis of variance (ANOVA) models: 

In [58]: data['key2'] = data[' key2' ]. map({0: 'zero', 1: 'one'}) 

In [59]: data 
0ut[59] : 

keyl key2 vl v2 
0 a zero 1 -1.0 

1 a one 2 0.0 

2 b zero 3 2.5 

3 b one 4 -0.5 

4 a zero 5 4.0 

5 b one 6 -1.2 

6 a zero 7 0.2 

7 b zero 8 -1.7 

In [60]: y, X = patsy.dmatrices( 'v2 ~ keyl + key2', data) 

In [61]: X 
0ut[61] : 

DesignMatrix with shape (8, 3) 

Intercept keyl[T.b] key2[T.zero] 


1 

1 

1 

1 

1 

1 

1 

1 


0 

0 

1 

1 

0 

1 

0 

1 


1 

0 

1 

0 

1 

0 

1 

1 


Terms : 


Intercept' (column 0) 
keyl' (column 1) 
key2' (column 2) 
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In [62]: y, X = patsy.dmatrlces( 'v2 ~ keyl + key2 + keyl:key2', data) 


In [63]: X 
0ut[63] : 

DesignMatrlx with shape (8, 4) 

Intercept keyl[T.b] key2[T.zero] 


1 

1 

1 

1 

1 

1 

1 

1 


0 

0 

1 

1 

0 

1 

0 

1 


1 

0 

1 

0 

1 

0 

1 

1 


Terns : 

'Intercept' (column 0) 
'keyl' (column 1) 
'key2' (column 2) 
'keyl:key2' (column 3) 


keyl[T.b] : key2[T.zero] 
0 
0 
1 
0 
0 
0 
0 
1 


Patsy provides for other ways to transform categorical data, including transforma¬ 
tions for terms with a particular ordering. See the online documentation for more. 


13.3 Introduction to statsmodels 

statsmodels is a Python library for fitting many kinds of statistical models, perform¬ 
ing statistical tests, and data exploration and visualization. Statsmodels contains more 
“classical” frequentist statistical methods, while Bayesian methods and machine learn¬ 
ing models are found in other libraries. 

Some kinds of models found in statsmodels include: 


• Linear models, generalized linear models, and robust linear models 

• Linear mixed effects models 

• Analysis of variance (ANOVA) methods 

• Time series processes and state space models 

• Generalized method of moments 


In the next few pages, we will use a few basic tools in statsmodels and explore how to 
use the modeling interfaces with Patsy formulas and pandas DataFrame objects. 

Estimating Linear Models 

There are several kinds of linear regression models in statsmodels, from the more 
basic (e.g., ordinary least squares) to more complex (e.g., iteratively reweighted least 
squares). 
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Linear models in statsmodels have two different main interfaces: array-based and 
formula-based. These are accessed through these API module imports: 

import statsmodels.api as sm 
import statsmodels.formula.api as smf 

To show how to use these, we generate a linear model from some random data: 

def dnorm(mean, variance, size=l): 
if tsinstance(size, int): 
size = size, 

return mean + np.sqrt(variance) * np.random.randn(*size) 

# For reproducibility 
np.random.seed (12345) 


N = 100 

X = np.c_[dnorm(0, 0.4, size=N), 
dnorm(0, 0.6, size=N), 
dnorm(0, 0.2, size=N)] 
eps = dnorm(0, 0.1, size=N) 
beta = [0.1, 0.3, 0.5] 

y = np.dot(X, beta) + eps 

Here, I wrote down the “true” model with known parameters beta. In this case, dnorm 
is a helper function for generating normally distributed data with a particular mean 
and variance. So now we have: 

In [66]: X[:5] 

0ut[66] : 

array( [[ -0. 1295, -1.2128, 0.5042], 

[ 0.3029, -0.4357, -0.2542], 

[-0.3285, -0.0253, 0.1384], 

[-0.3515, -0.7196, -0.2582], 

[ 1.2433, -0.3738, -0.5226]]) 

In [67]: y[:5] 

0ut[67] : array( [ 0.4279, -0.6735, -0.0909, -0.4895, -0.1289]) 

A linear model is generally fitted with an intercept term as we saw before with Patsy. 
The sm. add_constant function can add an intercept column to an existing matrix: 

In [68]: X_model = sm.add_constant(X) 


In [69]: X_model[:5] 
0ut[69] : 

array( [[ 1. , -0.1295, 

[ 1. , 0.3029, 

[ 1. , -0.3285, 

[ 1. , -0.3515, 

[ 1. , 1.2433, 


-1.2128, 0.5042], 

-0.4357, -0.2542], 
-0.0253, 0.1384], 

-0.7196, -0.2582], 
-0.3738, -0.5226]]) 
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The sm. OLS class can fit an ordinary least squares linear regression: 

In [70]: model = sm.OLS(y, X) 

The model’s fit method returns a regression results object containing estimated 
model parameters and other diagnostics: 

In [71]: results = model. fit() 

In [72]: results.params 

0ut[72] : array( [ 0.1783, 0.223 , 0.501 ]) 

The summary method on results can print a model detailing diagnostic output of the 
model: 


In [73]: print(results.summaryQ) 




OLS Regression 

Results 





Dep. Variable: 


y 

R-squared: 


0.430 

Model : 


OLS 

Adj. R-squared: 


0.413 

Method : 


Least Squares 

F-statistic: 


24.42 

Date: 

Mon, 25 Sep 2017 

Prob (F-statistic): 


7.44e-12 

Time: 


14:06:15 

Log-Likelihood: 


-34.305 

No. Observations: 

100 

AIC: 


74.61 

Df Residuals: 


97 

BIC: 


82.42 

Df Model: 


3 




Covariance Type: 

nonrobust 





coef 

std err 

t P>|t| 

[0.025 

0.975] 

xl 

0.1783 

0.053 

3.364 0.001 

0.073 

0.283 

x2 

0.2230 

0.046 

4.818 0.000 

0.131 

0.315 

x3 

0.5010 

0.080 

6.237 0.000 

0.342 

0.660 

Omnibus : 


4.662 

Durbin-Watson : 


2.201 

Prob(Omnibus) : 


0.097 

Jarque-Bera (3B): 


4.098 

Skew: 


0.481 

Prob(JB): 


0.129 

Kurtosis : 


3.243 

Cond. No. 


1.74 


Warnings : 

[1] Standard Errors assume that the covariance matrix of the errors is correctly 
specified. 


The parameter names here have been given the generic names xl, x2, and so on. 
Suppose instead that all of the model parameters are in a DataFrame: 

In [74]: data = pd.DataFrame(X, columns ['col0', 'coll', 'col2']) 

In [75]: data[ 'y' ] = y 

In [76]: data[:5] 

0ut[76] : 

col0 coll col2 y 
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0 -0.129468 -1.212753 0.504225 0.427863 

1 0.302910 -0.435742 -0.254180 -0.673480 

2 -0.328522 -0.025302 0.138351 -0.090878 

3 -0.351475 -0.719605 -0.258215 -0.489494 

4 1.243269 -0.373799 -0.522629 -0.128941 

Now we can use the statsmodels formula API and Patsy formula strings: 

In [77]: results = smf.ols('y ~ col0 + coll + col2', data=data) ,flt() 

In [78]: results.params 
0ut[78] : 

Intercept 0.033559 

col0 0.176149 

coll 0.224826 

col2 0.514808 

dtype: float64 

In [79]: results.tvalues 
0ut[79] : 

Intercept 0.952188 

col0 3.319754 

coll 4.850730 

col2 6.303971 

dtype: float64 

Observe how statsmodels has returned results as Series with the DataFrame column 
names attached. We also do not need to use add_constant when using formulas and 
pandas objects. 

Given new out-of-sample data, you can compute predicted values given the estimated 
model parameters: 

In [80]: results. predlct(data[ :5]) 

Out[80] : 

0 -0.002327 

1 -0.141904 

2 0.041226 

3 -0.323070 

4 -0.100535 
dtype: float64 

There are many additional tools for analysis, diagnostics, and visualization of linear 
model results in statsmodels that you can explore. There are also other kinds of linear 
models beyond ordinary least squares. 

Estimating Time Series Processes 

Another class of models in statsmodels are for time series analysis. Among these are 
autoregressive processes, Kalman filtering and other state space models, and multi¬ 
variate autoregressive models. 
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Let’s simulate some time series data with an autoregressive structure and noise: 
tnit_x = 4 

import random 

values = [lnit_x, init_x] 

N = 1000 


b0 = 0.8 
bl = -0.4 

noise = dnorm(0, 0.1, N) 
for i in range(N): 

new_x = values[-l] * b0 + values[-2] * bl + noisefi] 
values.append(new_x) 

This data has an AR(2) structure (two lags) with parameters 0.8 and -0.4. When you 
fit an AR model, you may not know the number of lagged terms to include, so you 
can fit the model with some larger number of lags: 

In [82]: MAXLAGS = 5 

In [83]: model = sm.tsa.AR(values) 

In [84]: results = model.fit(MAXLAGS) 

The estimated parameters in the results have the intercept first and the estimates for 
the first two lags next: 

In [85]: results.params 

0ut[85] : array([-0.0062, 0.7845, -0.4085, -0.0136, 0.015 , 0.0143]) 

Deeper details of these models and how to interpret their results is beyond what I can 
cover in this book, but there’s plenty more to discover in the statsmodels documenta¬ 
tion. 

13.4 Introduction to scikit-learn 

scikit-learn is one of the most widely used and trusted general-purpose Python 
machine learning toolkits. It contains a broad selection of standard supervised and 
unsupervised machine learning methods with tools for model selection and evalua¬ 
tion, data transformation, data loading, and model persistence. These models can be 
used for classification, clustering, prediction, and other common tasks. 

There are excellent online and printed resources for learning about machine learning 
and how to apply libraries like scikit-learn and TensorFlow to solve real-world prob¬ 
lems. In this section, I will give a brief flavor of the scikit-learn API style. 

At the time of this writing, scikit-learn does not have deep pandas integration, though 
there are some add-on third-party packages that are still in development, pandas can 
be very useful for massaging datasets prior to model fitting, though. 
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As an example, I use a now-classic dataset from a Kaggle competition about passen¬ 
ger survival rates on the Titanic, which sank in 1912. We load the test and training 
dataset using pandas: 

In [86]: train = pd.read_csv( 'datasets/titanic/train.csv' ) 

In [87]: test = pd.read_csv( 1 datasets/titanic/test.csv' ) 


In [88]: train[:4] 
0ut[88] : 



Passengerld 

Survived 1 

3 class \ 



0 

1 

0 

3 



1 

2 

1 

1 



2 

3 

1 

3 



3 

4 

1 

1 






Name 

Sex 

Age 

0 



Braund, Mr. Owen Harris 

male 

22.0 

1 

Cumings, Mrs. 

John Bradley (Florence Briggs Th... 

female 

38.0 

2 



Heikkinen, Miss. Laina 

female 

26.0 

3 

Futrelle 

, Mrs. Jacques Heath (Lily May Peel) 

female 

35.0 


Parch 

Ticket 

Fare Cabin Embarked 



0 

0 

A/5 21171 

7.2500 NaN S 



1 

0 

PC 17599 

71.2833 C85 C 



2 

0 ST0N/02. 3101282 

7.9250 NaN S 



3 

0 

113803 

53.1000 C123 S 




SibSp 

1 

1 

0 

1 


\ 


Libraries like statsmodels and scikit-learn generally cannot be fed missing data, so we 
look at the columns to see if there are any that contain missing data: 


In [89]: train.isnult() ,sun() 


0ut[89] : 

Passengerld 0 
Survived 0 
Pclass 0 
Name 0 
Sex 0 
Age 177 
SibSp 0 
Parch 0 
Ticket 0 
Fare 0 
Cabin 687 
Embarked 2 


dtype: int64 


In [96]: test,isnull( ). sum( ) 


Out[90] : 

Passengerld 0 
Pclass 0 
Name 0 
Sex 0 
Age 86 
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SlbSp 0 
Parch 0 
Ticket 0 
Fare 1 
Cabin 327 
Embarked 0 


dtype: int64 

In statistics and machine learning examples like this one, a typical task is to predict 
whether a passenger would survive based on features in the data. A model is fitted on 
a training dataset and then evaluated on an out-of-sample testing-dataset. 

I would like to use Age as a predictor, but it has missing data. There are a number of 
ways to do missing data imputation, but I will do a simple one and use the median of 
the training dataset to fill the nulls in both tables: 

In [91]: impute_value = train[ 'Age' ] ,median() 

In [92]: train['Age'] = train[ 'Age' ]. fillna(impute_value) 

In [93]: testf'Age'] = test[ 'Age' ]. fillna(impute_value) 

Now we need to specify our models. I add a column Is Female as an encoded version 
of the ' Sex' column: 

In [94]: train[ 'IsFemale' ] = (train[ 'Sex' ] == ' female ') .astype(lnt) 

In [95]: test[' IsFemale' ] = (test['Sex'] == 'female '). astype(int) 

Then we decide on some model variables and create NumPy arrays: 

In [96]: predictors = ['Pclass', 'IsFemale', 'Age'] 

In [97]: X_train = traln[predictors].values 
In [98]: X_test = test[predictors] .values 
In [99]: y_train = train[' Survived ']. values 


In [100]: 
Out[100] : 

X_train[ :5] 

array([[ 

3., 0., 22.], 

[ 

1., 1., 38.], 

[ 

3., 1., 26.], 

[ 

1., 1., 35.], 

[ 

3., 0., 35.]]) 

In [101]: 

y_train[ :5] 

Out[101] : 

array( [0, 1, 1, 1. 


I make no claims that this is a good model nor that these features are engineered 
properly. We use the LogisticRegression model from scikit-learn and create a 
model instance: 
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In [102]: from import LogistlcRegression 

In [103]: model = LogisticRegression( ) 

Similar to statsmodels, we can fit this model to the training data using the model’s fit 
method: 

In [104]: model.fit(X_train, y_train) 

Out[104] : 

LogisticRegression(C=l. 0, class_weight=None, dual=False, fit_intercept=True, 
intercept_scaling=l, max_iter=100, multi_class= 'ovr' , n_jobs=l, 
penalty= '12 1 , random_state=None, solver='llblinear' , tol=0.0001, 
verbose=0, warm_start=False) 

Now, we can form predictions for the test dataset using model. predict: 

In [105]: y_predict = model.predict(X_test) 

In [106]: y_predict[ : 10] 

Out[106] : array( [0, 0, 0, 0, 1, 0, 1, 0, 1, 0]) 

If you had the true values for the test dataset, you could compute an accuracy per¬ 
centage or some other error metric: 

(y_true == y_predtct) .meanQ 

In practice, there are often many additional layers of complexity in model training. 
Many models have parameters that can be tuned, and there are techniques such as 
cross-validation that can be used for parameter tuning to avoid overfitting to the 
training data. This can often yield better predictive performance or robustness on 
new data. 

Cross-validation works by splitting the training data to simulate out-of-sample pre¬ 
diction. Based on a model accuracy score like mean squared error, one can perform a 
grid search on model parameters. Some models, like logistic regression, have estima¬ 
tor classes with built-in cross-validation. For example, the LogisticRegressionCV 
class can be used with a parameter indicating how fine-grained of a grid search to do 
on the model regularization parameter C: 

In [107]: from import LogisticRegressionCV 

In [108]: model_cv = LogisticRegressionCV(10) 

In [109]: model_cv.fit(X_train, y_train) 

Out[109] : 

LogisticRegressionCV(Cs=10, class_weight=None, cv=None, dual=False, 
fit_intercept=True, intercept_scaling=1.0, max_iter=100, 
multi_class= 'ovr' , n_jobs=l, penalty= '12' , random_state=None, 
refit=True, scoring=None, solver= 'Ibfgs' , tol=0.0001, verbose=0) 
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To do cross-validation by hand, you can use the cross_val_score helper function, 
which handles the data splitting process. For example, to cross-validate our model 
with four non-overlapping splits of the training data, we can do: 

In [110]: from inport cross_val_score 

In [ill]: node! = LogisticRegression(C=10) 

In [112]: scores = cross_vat_score(nodel, X_train, y_traln, cv=4) 

In [113]: scores 

0ut[113] : array( [ 0.7723, 0.8027, 0.7703, 0.7883]) 

The default scoring metric is model-dependent, but it is possible to choose an explicit 
scoring function. Cross-validated models take longer to train, but can often yield bet¬ 
ter model performance. 

13.5 Continuing Your Education 

While I have only skimmed the surface of some Python modeling libraries, there are 
more and more frameworks for various kinds of statistics and machine learning 
either implemented in Python or with a Python user interface. 

This book is focused especially on data wrangling, but there are many others dedica¬ 
ted to modeling and data science tools. Some excellent ones are: 

• Introduction to Machine Learning with Python by Andreas Mueller and Sarah 
Guido (O’Reilly) 

• Python Data Science Handbook by Jake VanderPlas (O’Reilly) 

• Data Science from Scratch: First Principles with Python by Joel Grus (O’Reilly) 

• Python Machine Learning by Sebastian Raschka (Packt Publishing) 

• Hands-On Machine Learning with Scikit-Learn and TensorFlow by Aurelien 
Geron (O’Reilly) 

While books can be valuable resources for learning, they can sometimes grow out of 
date when the underlying open source software changes. It’s a good idea to be familiar 
with the documentation for the various statistics or machine learning frameworks to 
stay up to date on the latest features and API. 
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CHAPTER 14 


Data Analysis Examples 


Now that we’ve reached the end of this book’s main chapters, we’re going to take a 
look at a number of real-world datasets. For each dataset, we’ll use the techniques 
presented in this book to extract meaning from the raw data. The demonstrated tech¬ 
niques can be applied to all manner of other datasets, including your own. This chap¬ 
ter contains a collection of miscellaneous example datasets that you can use for 
practice with the tools in this book. 

The example datasets are found in the book’s accompanying GitHub repository. 

14.11.USA.gov Data from Bitly 

In 2011, URL shortening service Bitly partnered with the US government website 
USA.gov to provide a feed of anonymous data gathered from users who shorten links 
ending with .gov or .mil. In 2011, a live feed as well as hourly snapshots were available 
as downloadable text files. This service is shut down at the time of this writing (2017), 
but we preserved one of the data files for the book’s examples. 

In the case of the hourly snapshots, each line in each file contains a common form of 
web data known as JSON, which stands for JavaScript Object Notation. For example, 
if we read just the first line of a file we may see something like this: 

In [5]: path = 'datasets/bitly_usagov/example.txt' 

In [6]: open(path). readline( ) 

0ut[6] : '{ "a": "Mozilla\\/5.0 (Windows NT 6.1; W0W64) AppleWebKit\\/535.11 
(KHTML, like Gecko) Chrome\\/17. 0.963.78 Safari\\/535 .11" , "c": "US", "nk”: 1, 
"tz": "America\\/New_York" , "gr": "MA", "g": "A6qOVH", "h": "wfLQtf", "l": 
"orofrog", "al": "en-US,en;q=0.8", "hh": "1.usa.gov", "r": 
"http:\\/\\/www.facebook.com\\/l\\/7AQEFzjSi\\/l.usa.gov\\/wfLQtf" , "u" : 
"http:\\/\\/www.ncbi.nlm.nih.gov\\/pubmed\\/22415991" , "t": 1331923247, "he": 
1331822918, "cy": "Danvers", "U": [ 42.576698, -70.954903 ] }\n ' 
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Python has both built-in and third-party libraries for converting a JSON string into a 
Python dictionary object. Here we’ll use the json module and its loads function 
invoked on each line in the sample file we downloaded: 

import json 

path = 'datasets/bltty_usagov/exampte.txt' 
records = [json.loads(tine) for tine in open(path)] 

The resulting object records is now a list of Python diets: 

In [18]: records[0] 

0ut[18] : 

{'a': 'Moztlla/5.0 (Windows NT 6.1; W0W64) AppteWebKit/535.11 (KHTML, like Gecko) 
Chrome/17. 0.963.78 Safari/535 .11 1 , 

'at 1 : 'en-US,en;q=0.8' , 

'c' : 'US', 

'ey': 'Danvers' , 

'g' : ' A6qOVH' , 

'gr' : ' MA' , 

'h' : ' wfLQtf' , 

'he': 1331822918, 

'hh' : '1.usa.gov' , 

'1' : ' orofrog' , 

'll': [42.576698, -70.954903], 

'nk' : 1, 

'r' : ' http://www.facebook.eom/l/7AQEFzjSi/l.usa.gov/wfLQtf , 

't' : 1331923247, 

'tz' : 'America/New_York' , 

'u' : 'http://www.ncbi.nlm.nih.gov/pubmed/22415991'} 

Counting Time Zones in Pure Python 

Suppose we were interested in finding the most often-occurring time zones in the 
dataset (the tz field). There are many ways we could do this. First, let’s extract a list of 
time zones again using a list comprehension: 

In [12]: time_zones = [recf'tz 1 ] for rec in records] 


KeyError Traceback (most recent call last) 

<ipython-input-12-db4fbd348da9> in <module>() 

-> 1 time_zones = [rec['tz'] for rec in records] 

<ipython-input-12-db4fbd348da9> in <listcomp>( .0) 

-> 1 time_zones = [rec['tz'] for rec in records] 

KeyError: 'tz' 

Oops! Turns out that not all of the records have a time zone field. This is easy to han¬ 
dle, as we can add the check if 1 tz' in rec at the end of the list comprehension: 

In [13]: time_zones = [rec['tz'] for rec in records if 'tz' in rec] 

In [14]: time_zones[ : 10] 

0ut[14] : 
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[ 'Anerica/New_York' , 

1 America/Denver' , 
'Anerica/New_York' , 
'Anerica/Sao_Paulo' , 
'Anerica/New_York' , 
'Anerica/New_York' , 

1 Europe/Warsaw' , 


"] 

Just looking at the first 10 time zones, we see that some of them are unknown (empty 
string). You can filter these out also, but I’ll leave them in for now. Now, to produce 
counts by time zone I’ll show two approaches: the harder way (using just the Python 
standard library) and the easier way (using pandas). One way to do the counting is to 
use a diet to store counts while we iterate through the time zones: 

def get_counts(sequence) : 
counts = {} 
for x in sequence: 
if x in counts: 

counts[x] += 1 
else: 

counts[x] = 1 
return counts 

Using more advanced tools in the Python standard library, you can write the same 
thing more briefly: 

from inport defaultdict 

def get_counts2(sequence) : 

counts = defaultdict(int) # values will initialize to 0 
for x in sequence: 

counts[x] += 1 
return counts 

I put this logic in a function just to make it more reusable. To use it on the time 
zones, just pass the time_zones list: 

In [17]: counts = get_counts(tine_zones) 

In [18]: countsf 'Anerica/New_York' ] 

0ut[18] : 1251 

In [19]: len(time_zones) 

0ut[19] : 3440 

If we wanted the top 10 time zones and their counts, we can do a bit of dictionary 
acrobatics: 

def top_counts(count_dict, n=10): 

value_key_pairs = [(count, tz) for tz, count in count_dict.itens()] 
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value_key_pairs . sort( ) 
return value_key_pairs[ n:] 

We have then: 

In [21]: top_counts(counts) 

0ut[21] : 

[(33, ' America/Sao_Paulo' ), 

(35, 'Europe/Madrid'), 

(36, ' Pactfic/Honolulu' ), 

(37, 'Asia/Tokyo'), 

(74, 'Europe/London'), 

(191, 'America/Denver' ), 

(382, 'America/Los_Angeles' ), 

(400, 'America/Chicago' ), 

(521, "), 

(1251, 'America/New_York' )] 

If you search the Python standard library, you may find the collections.Counter 
class, which makes this task a lot easier: 

In [22]: fron Import Counter 

In [23]: counts = Counter(time_zones) 

In [24]: counts.most_common( 10) 

0ut[24] : 

[( 'America/New_York' , 1251), 

(", 521), 

('America/Chicago', 400), 

( 'America/Los_Angeles' , 382), 

('America/Denver', 191), 

('Europe/London', 74), 

('Asia/Tokyo', 37), 

('Pacific/Honolulu', 36), 

('Europe/Madrid', 35), 

( 'America/Sao_Paulo' , 33)] 

Counting Time Zones with pandas 

Creating a DataFrame from the original set of records is as easy as passing the list of 
records to pandas.DataFrame: 

In [25]: import as 

In [26]: frame = pd.DataFrame(records) 

In [27]: frame.info() 

<class ' pandas .core.frame.DataFrame '> 

Rangelndex: 3560 entries, 0 to 3559 
Data columns (total 18 columns): 

_heartbeat_ 120 non-null float64 
a 3440 non-null object 


406 | Chapter 14: Data Analysis Examples 



at 

3094 

non-null object 

c 

2919 

non-null object 

cy 

2919 

non-null object 

g 

3440 

non-null object 

gr 

2919 

non-null object 

h 

3440 

non-null object 

he 

3440 

non-null float64 

hh 

3440 

non-null object 

kw 

93 non-null object 

l 

3440 

non-null object 

ll 

2919 

non-null object 

nk 

3440 

non-null float64 

r 

3440 

non-null object 

t 

3440 

non-null float64 

tz 

3440 

non-null object 

u 

3440 

non-null object 

dtypes: 

float64(4) , 

object(14) 


memory usage: 500.7+ KB 


In [28]: frame[ 'tz 1 ][: 10] 

0ut[28] : 

0 Amerlca/New_York 

1 America/Denver 

2 Amerlca/New_York 

3 America/Sao_Paulo 

4 Amerlca/New_York 

5 Amerlca/New_York 

6 Europe/Warsaw 

7 

8 
9 

Name: tz, dtype: object 

The output shown for the frame is the summary view, shown for large DataFrame 
objects. We can then use the value_counts method for Series: 

In [29]: tz_counts = frame[' tz' ]. value_counts( ) 

In [30]: tz_counts[ : 10] 


Out[30] : 

Amerlca/New_York 1251 

521 

Amerlca/Chtcago 400 

Amerlca/Los_Angeles 382 

Amertca/Denver 191 

Europe/London 74 

Asia/Tokyo 37 

Pactfic/Honolulu 36 

Europe/Madrid 35 

Amertca/Sao_Paulo 33 


Name: tz, dtype: tnt64 
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We can visualize this data using matplotlib. You can do a bit of munging to fill in a 
substitute value for unknown and missing time zone data in the records. We replace 
the missing values with the fillna method and use boolean array indexing for the 
empty strings: 

In [31]: ctean_tz = frarne[' tz ']. fillna( 'Missing' ) 

In [32]: clean_tz[clean_tz == ''] = 'Unknown' 

In [33]: tz_counts = clean_tz.vatue_counts( ) 

In [34]: tz_counts[ : 10] 

0ut[34] : 

America/New_York 1251 

Unknown 521 

America/Chicago 400 

Amerlca/Los_Angeles 382 

America/Denver 191 

Missing 120 

Europe/London 74 

Asia/Tokyo 37 

Pacific/Honolulu 36 

Europe/Madrid 35 

Name: tz, dtype: int64 

At this point, we can use the seaborn package to make a horizontal bar plot (see 
Figure 14-1 for the resulting visualization): 

In [36]: import as 

In [37]: subset = tz_counts[ :10] 

In [38]: sns.barptot(y=subset.index, x=subset.values) 



Figure 14-1. Top time zones in the 1. usa.gov sample data 
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The a field contains information about the browser, device, or application used to 
perform the URL shortening: 

In [39]: frame[ 'a' ][1] 

0ut[39] : 'GoogleMaps/RochesterNY' 

In [40]: frame[ 'a' ][50] 

Out[40] : 'Mozllla/5.0 (Windows NT 5.1; rv:10.0.2) Gecko/20100101 Flrefox/10.0.2' 
In [41]: frame[ 'a' ][51][: 50] # long line 

0ut[41]: 'Mozllla/5.0 (Linux; U; Android 2.2.2; en-us; LG-P9 1 

Parsing all of the interesting information in these “agent” strings may seem like a 
daunting task. One possible strategy is to split off the first token in the string (corre¬ 
sponding roughly to the browser capability) and make another summary of the user 
behavior: 

In [42]: results = pd.Serles([x.spllt()[0] for x In frame.a.dropna()]) 

In [43]: results[:5] 

0ut[43] : 

0 Mozllla/5.0 

1 GoogleMaps/RochesterNY 

2 Mozllla/4.0 

3 Mozllla/5.0 

4 Mozllla/5.0 
dtype: object 

In [44]: results.value_counts()[ :8] 


0ut[44] : 

Mozllla/5.0 2594 
Mozllla/4.0 601 
GoogleMaps/RochesterNY 121 
Opera/9.80 34 
TEST_INTERNET_AGENT 24 
GoogleProducer 21 
Moztlla/6.0 5 
BlackBerry8520/5.0.0.681 4 
dtype: tnt64 


Now, suppose you wanted to decompose the top time zones into Windows and non- 
Windows users. As a simplification, let’s say that a user is on Windows if the string 
1 Windows 1 is in the agent string. Since some of the agents are missing, we’ll exclude 
these from the data: 

In [45]: cframe = frane[frane.a.notnull()] 

We want to then compute a value for whether each row is Windows or not: 

In [47]: cframe[ 1 os 1 ] = np.where(cf rame[ 1 a 1 ]. str.contalns( 1 Windows 1 ), 

....: 'Windows', 'Not Windows') 


In [48]: cframef 'os' ][: 5] 
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Out [48]: 

0 Windows 

1 Not Windows 

2 Windows 

3 Not Windows 

4 Windows 

Name: os, dtype: object 

Then, you can group the data by its time zone column and this new list of operating 
systems: 

In [49]: by_tz_os = cfrane.groupby( [ 'tz 1 , 'os']) 

The group counts, analogous to the value_counts function, can be computed with 
size. This result is then reshaped into a table with unstack: 

In [50]: agg_counts = by_tz_os.size( ). unstack( ). filtna(0) 


In [51]: agg_counts[ :10] 
0ut[51] : 


os 

Not Windows 

Windows 

tz 

245.0 

276.0 

Africa/Cairo 

0.0 

3.0 

Africa/Casablanca 

0.0 

1.0 

Africa/Ceuta 

0.0 

2.0 

Africa/3ohannesburg 

0.0 

1.0 

Africa/Lusaka 

0.0 

1.0 

America/Anchorage 

4.0 

1.0 

Arnerica/Argentina/Buenos_Aires 

1.0 

0.0 

America/Argentina/Cordoba 

0.0 

1.0 

Arnerica/Argentina/Mendoza 

0.0 

1.0 


Finally, let’s select the top overall time zones. To do so, I construct an indirect index 
array from the row counts in agg_counts: 

# Use to sort in ascending order 

In [52]: indexer = agg_counts.sum(l).argsort( ) 


In [53]: indexer[:10] 

0ut[53] : 
tz 

24 

Africa/Cairo 20 

Africa/Casablanca 21 

Africa/Ceuta 92 

Africa/3ohannesburg 87 

Africa/Lusaka 53 

America/Anchorage 54 

America/Argentina/Buenos_Aires 57 
America/Argentina/Cordoba 26 

America/Argentina/Mendoza 55 

dtype: int64 


410 | Chapter 14: Data Analysis Examples 



I use take to select the rows in that order, then slice off the last 10 rows (largest 
values): 

In [54]: count_subset = agg_counts.take(indexer [- 10: ]) 


In [55]: count_subset 
0ut[55] : 


os 

Not Windows 

Windows 

tz 



America/Sao_Paulo 

13.0 

20.0 

Europe/Madrid 

16.0 

19.0 

Pacific/Honolulu 

0.0 

36.0 

Asia/Tokyo 

2.0 

35.0 

Europe/London 

43.0 

31.0 

America/Denver 

132.0 

59.0 

America/Los_Angeles 

130.0 

252.0 

America/Chicago 

115.0 

285.0 


245.0 

276.0 

America/New_York 

339.0 

912.0 


pandas has a convenience method called nlargest that does the same thing: 


In [56]: agg_counts. 
0ut[56] : 
tz 

sum(l). nlargest (10) 

Amerlca/New_York 

1251.0 

521.0 

America/Chicago 

400.0 

Amerlca/Los_Angeles 

382.0 

America/Denver 

191.0 

Europe/London 

74.0 

Asia/Tokyo 

37.0 

Pacific/Honolulu 

36.0 

Europe/Madrid 

35.0 

America/Sao_Paulo 
dtype: float64 

33.0 


Then, as shown in the preceding code block, this can be plotted in a bar plot; I’ll 
make it a stacked bar plot by passing an additional argument to seaborn’s barplot 
function (see Figure 14-2): 

# Rearrange the data for plotting 

In [58]: count_subset = count_subset.stack( ) 

In [59]: count_subset.name = 'total' 

In [60]: count_subset = count_subset.reset_index( ) 

In [61]: count_subset[ : 10] 

0ut[61] : 

tz os total 

0 Amerlca/Sao_Paulo Not Windows 13.0 

1 Amerlca/Sao_Paulo Windows 20.0 
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2 

Europe/Madrid 

Not Windows 

16.0 

3 

Europe/Madrid 

Windows 

19.0 

4 

Pacific/Honolulu 

Not Windows 

0.0 

5 

Pacific/Honolulu 

Windows 

36.0 

6 

Asia/Tokyo 

Not Windows 

2.0 

7 

Asia/Tokyo 

Windows 

35.0 

8 

Europe/London 

Not Windows 

43.0 

9 

Europe/London 

Windows 

31.0 

In 

[62]: sns.barplot(x= 1 total 1 , y= 

'tz' , 1 


America/Sao_Paulo || 
Europe/Madrid | 
Pacific/Honolulu _ 
Asia/Tokyo ^ 
Europe/London 
America/Denver 
America/Los_Angeles 
America/Chicago 

America/New York 


Not Windows 
Windows 



400 600 

mean(total) 


Figure 14-2. Top time zones by Windows and non-Windows users 


The plot doesn’t make it easy to see the relative percentage of Windows users in the 
smaller groups, so let’s normalize the group percentages to sum to 1: 

def norm_total(group) : 

group[ 'normed_total' ] = group.totat / group.total.sum( ) 
return group 

results = count_subset,groupby( 'tz' ). apply(norm_total) 

Then plot this in Figure 14-3: 

In [65]: sns,barplot(x= 'normed_total' , y='tz', hue='os', data=results) 
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America/Sao_Paulo 

Europe/Madrid 

Pacific/Honolulu 

Asia/Tokyo 

Europe/London 

America/Denver 

America/Los_Angeles 

America/Chicago 

America/New_York 

0 


Figure 14-3. Percentage Windows and non-Windows users in top-occurring time zones 

We could have computed the normalized sum more efficiently by using the trans 
form method with groupby: 

In [66]: g = count_subset.groupby( 'tz' ) 

In [67]: results2 = count_subset.total / g.total.transform( 'sum' ) 

14.2 MovieLens 1M Dataset 

GroupLens Research provides a number of collections of movie ratings data collected 
from users of MovieLens in the late 1990s and early 2000s. The data provide movie 
ratings, movie metadata (genres and year), and demographic data about the users 
(age, zip code, gender identification, and occupation). Such data is often of interest in 
the development of recommendation systems based on machine learning algorithms. 
While we do not explore machine learning techniques in detail in this book, I will 
show you how to slice and dice datasets like these into the exact form you need. 

The MovieLens 1M dataset contains 1 million ratings collected from 6,000 users on 
4,000 movies. It’s spread across three tables: ratings, user information, and movie 
information. After extracting the data from the ZIP file, we can load each table into a 
pandas DataFrame object using pandas. read_table: 
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import pandas as pd 


# Make display smaller 

pd.options.display.max_rows = 10 

unames = ['user_id', 'gender', 'age', 'occupation', ’zip'] 
users = pd.read_table( 'datasets/movielens/users.dat' , sep='::', 
header=None, names=unames) 

rnames = ['user_id', 'movie_id', 'rating', 'timestamp'] 
ratings = pd.read_table( 'datasets/movielens/ratings.dat' , sep='::', 
header=None, names=rnames) 


mnames = ['movie_id', 'title', 'genres'] 

movies = pd.read_table( 'datasets/movielens/movies.dat' , sep='::', 
header=None, names=mnames) 

You can verify that everything succeeded by looking at the first few rows of each 
DataFrame with Python’s slice syntax: 


In [69]: users[:5] 
0ut[69] : 



user_id gender 

age 

occupation 

zip 

0 

1 

F 

1 

10 

48067 

1 

2 

M 

56 

16 

70072 

2 

3 

M 

25 

15 

55117 

3 

4 

M 

45 

7 

02460 

4 

5 

M 

25 

20 

55455 


In [70]: ratings[:5] 
Out[70] : 

user_id movie_id 
0 1 1193 

1 1 661 

2 1 914 

3 1 3408 

4 1 2355 


rating timestamp 
5 978300760 

3 978302109 

3 978301968 

4 978300275 

5 978824291 


In [71]: movies[:5] 

0ut[71] : 

movie_id 
0 1 

1 2 

2 3 

3 4 

4 5 Father of the 

In [72]: ratings 
0ut[72] : 

user_id movie_id 
0 1 1193 

1 1 661 

2 1 914 


title 

Toy Story (1995) 
Jumanji (1995) 
Grumpier Old Men (1995) 
Waiting to Exhale (1995) 
Bride Part II (1995) 


rating timestamp 
5 978300760 

3 978302109 

3 978301968 


genres 

Animation|Children 's|Comedy 
Adventure | Children' s|Fantasy 
Comedy | Romance 
Comedy | Drama 
Comedy 
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3 

1 

3408 

4 

978300275 

4 

1 

2355 

5 

978824291 

1000204 

6040 

1091 

1 

956716541 

1000205 

6040 

1094 

5 

956704887 

1000206 

6040 

562 

5 

956704746 

1000207 

6040 

1096 

4 

956715648 

1000208 

6040 

1097 

4 

956715569 


[1000209 rows x 4 columns] 

Note that ages and occupations are coded as integers indicating groups described in 
the datasets README file. Analyzing the data spread across three tables is not a sim¬ 
ple task; for example, suppose you wanted to compute mean ratings for a particular 
movie by sex and age. As you will see, this is much easier to do with all of the data 
merged together into a single table. Using pandas’s merge function, we first merge 
ratings with users and then merge that result with the movies data, pandas infers 
which columns to use as the merge (or join) keys based on overlapping names: 

In [73]: data = pd.merge(pd.merge(ratings, users), movies) 


In [74]: data 
0ut[74] : 



user_id 

movie_id 

rating 

timestamp 

gender 

age 

occupation zip 

0 

1 

1193 


5 

978300760 

F 

1 

10 48067 

1 

2 

1193 


5 

978298413 

M 

56 

16 70072 

2 

12 

1193 


4 

978220179 

M 

25 

12 32793 

3 

15 

1193 


4 

978199279 

M 

25 

7 22903 

4 

17 

1193 


5 

978158471 

M 

50 

1 95350 

1000204 

5949 

2198 


5 

958846401 

M 

18 

17 47901 

1000205 

5675 

2703 


3 

976029116 

M 

35 

14 30030 

1000206 

5780 

2845 


1 

958153068 

M 

18 

17 92886 

1000207 

5851 

3607 


5 

957756608 

F 

18 

20 55410 

1000208 

5938 

2909 


4 

957273353 

M 

25 

1 35401 







title 


genres 

0 

One 

Flew Over 

the 

Cuckoo's Nest 

(1975) 


Drama 

1 

One 

Flew Over 

the 

Cuckoo's Nest 

(1975) 


Drama 

2 

One 

Flew Over 

the 

Cuckoo's Nest 

(1975) 


Drama 

3 

One 

Flew Over 

the 

Cuckoo's Nest 

(1975) 


Drama 

4 

One 

Flew Over 

the 

Cuckoo's Nest 

(1975) 


Drama 

1000204 




Modulations 

(1998) 


Documentary 

1000205 



Broken Vessels 

(1998) 


Drama 

1000206 





White Boys 

(1999) 


Drama 

1000207 



One 

Little Indian 

(1973) 

Comedy|Drama|Western 

1000208 

Five Wives, Three ! 

Secretaries and He 

(1998) 


Documentary 


[1000209 rows x 10 columns] 

In [75]: data.iloc[0] 

0ut[75] : 

user id 1 
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movle_ld 
rating 
timestamp 
gender 
age 

occupation 
zip 

title One Flew Over the Cuckoo's 

genres 

Name: 0, dtype: object 

To get mean movie ratings for each film 
pivot_table method: 


1193 

5 

978300760 

F 

1 

10 
48067 
Nest (1975) 

Drama 

grouped by gender, we can use the 


In [76]: mean_ratlngs = data.plvot_table( 'rating' , lndex= 'title 1 , 

columns='gender' , aggfunc= 'mean' ) 


In [77]: mean_ratlngs[ :5] 

0ut[77] : 

gender F 

title 

$1,000,000 Duck (1971) 3.375000 

'Night Mother (1986) 3.388889 

'Til There Was You (1997) 2.675676 

'burbs. The (1989) 2.793478 


...And Justice for All (1979) 3.828571 


M 

2.761905 

3.352941 

2.733333 

2.962085 

3.689024 


This produced another DataFrame containing mean ratings with movie titles as row 
labels (the “index”) and gender as column labels. I first filter down to movies that 
received at least 250 ratings (a completely arbitrary number); to do this, I then group 
the data by title and use size() to get a Series of group sizes for each title: 


In [78]: ratlngs_by_tltle = data,groupby( 'title '). stze( ) 


In [79]: ratlngs_by_tltle[ : 10] 

0ut[79] : 
title 

$1,000,000 Duck (1971) 37 

'Night Mother (1986) 70 

'Til There Was You (1997) 52 

'burbs. The (1989) 303 

...And Justice for All (1979) 199 

1-900 (1994) 2 

10 Things I Hate About You (1999) 700 

101 Dalmatians (1961) 565 

101 Dalmatians (1996) 364 

12 Angry Men (1957) 616 

dtype: tnt64 


In [80]: actlve_tltles = ratlngs_by_tltle.lndex[ratlngs_by_tltle >= 250] 


In [81]: actlve_tltles 
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Out [81]: 

Index([ 1 'burbs, The (1989)', '10 Things I Hate About You (1999)', 

'101 Dalmatians (1961)', '101 Dalmatians (1996)', '12 Angry Men (1957)', 
'13th Warrior, The (1999)', '2 Days in the Valley (1996)', 

'20,000 Leagues Under the Sea (1954)', '2001: A Space Odyssey (1968)', 
'2010 (1984)', 

'X-Men (2000)', 'Year of Living Dangerously (1982)', 

'Yellow Submarine (1968)', 'You've Got Mail (1998)', 

'Young Frankenstein (1974)', 'Young Guns (1988)', 

'Young Guns II (1990)', 'Young Sherlock Holmes (1985)', 

'Zero Effect (1998)', 'eXistenZ (1999)'], 
dtype='object' , name= 'title' , length=1216) 

The index of titles receiving at least 250 ratings can then be used to select rows from 
mean_ratings: 

# Select rows on the index 

In [82]: mean_ratings = mean_ratings. loc[active_titles] 


In [83]: mean_ratings 
0ut[83] : 


gender 


F 

M 

title 




'burbs. The (1989) 


2.793478 

2.962085 

10 Things I Hate About 

You (1999) 

3.646552 

3.311966 

101 Dalmatians (1961) 


3.791444 

3.500000 

101 Dalmatians (1996) 


3.240000 

2.911215 

12 Angry Men (1957) 


4.184397 

4.328421 

Young Guns (1988) 


3.371795 

3.425620 

Young Guns II (1990) 


2.934783 

2.904025 

Young Sherlock Holmes 

(1985) 

3.514706 

3.363344 

Zero Effect (1998) 


3.864407 

3.723140 

eXistenZ (1999) 


3.098592 

3.289086 


[1216 rows x 2 columns] 

To see the top films among female viewers, we can sort by the F column in descend¬ 
ing order: 

In [85]: top_female_ratings = mean_ratings.sort_values(by= 'F' , ascending=False) 


In [86]: top_female_ratings[ :10] 
0ut[86] : 


gender 


F 

M 

title 




Close Shave, A (1995) 


4.644444 

4.473795 

Wrong Trousers, The (1993) 


4.588235 

4.478261 

Sunset Blvd. (a.k.a. Sunset Boulevard) 

(1950) 

4.572650 

4.464589 

Wallace & Gromit: The Best of Aardman 

Animation.. 

. 4.563107 

4.385075 

Schindler's List (1993) 


4.562602 

4.491415 

Shawshank Redemption, The (1994) 


4.539075 

4.560625 

Grand Day Out, A (1992) 


4.537879 

4.293255 
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To Kill a Mockingbird (1962) 
Creature Comforts (1990) 
Usual Suspects, The (1995) 


4.536667 

4.513889 

4.513317 


4.372611 

4.272277 

4.518248 


Measuring Rating Disagreement 

Suppose you wanted to find the movies that are most divisive between male and 
female viewers. One way is to add a column to rnean_ratings containing the differ¬ 
ence in means, then sort by that: 

In [87]: mean_ratings[ 'diff'] = mean_ratings[' M'] - mean_ratings[ 'F 1 ] 

Sorting by 'diff' yields the movies with the greatest rating difference so that we can 
see which ones were preferred by women: 

In [88]: sorted_by_diff = mean_ratings.sort_values(by=' diff 1 ) 


In [89]: sorted_by_diff [: 10] 
0ut[89] : 


gender 

F 

M diff 

title 



Dirty Dancing (1987) 

3.790378 

2.959596 -0.830782 

Tumpin' Jack Flash (1986) 

3.254717 

2.578358 -0.676359 

Grease (1978) 

3.975265 

3.367041 -0.608224 

Little Women (1994) 

3.870588 

3.321739 -0.548849 

Steel Magnolias (1989) 

3.901734 

3.365957 -0.535777 

Anastasia (1997) 

3.800000 

3.281609 -0.518391 

Rocky Horror Picture Show, The (1975) 

3.673016 

3.160131 -0.512885 

Color Purple, The (1985) 

4.158192 

3.659341 -0.498851 

Age of Innocence, The (1993) 

3.827068 

3.339506 -0.487561 

Free Willy (1993) 

2.921348 

2.438776 -0.482573 


Reversing the order of the rows and again slicing off the top 10 rows, we get the mov¬ 
ies preferred by men that women didn’t rate as highly: 


# Reverse order of rows, take first 16 rows 
In [90]: sorted_by_diff [::-1][: 10] 

Out[90] : 


gender 

F 

M 

diff 

title 




Good, The Bad and The Ugly, The (1966) 

3.494949 

4.221300 

0.726351 

Kentucky Fried Movie, The (1977) 

2.878788 

3.555147 

0.676359 

Dumb & Dumber (1994) 

2.697987 

3.336595 

0.638608 

Longest Day, The (1962) 

3.411765 

4.031447 

0.619682 

Cable Guy, The (1996) 

2.250000 

2.863787 

0.613787 

Evil Dead II (Dead By Dawn) (1987) 

3.297297 

3.909283 

0.611985 

Hidden, The (1987) 

3.137931 

3.745098 

0.607167 

Rocky III (1982) 

2.361702 

2.943503 

0.581801 

Caddyshack (1980) 

3.396135 

3.969737 

0.573602 

For a Few Dollars More (1965) 

3.409091 

3.953795 

0.544704 
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Suppose instead you wanted the movies that elicited the most disagreement among 
viewers, independent of gender identification. Disagreement can be measured by the 
variance or standard deviation of the ratings: 

# Standard deviation of rating grouped by title 

In [91]: rati.ng_std_by_ti.tle = data.groupby( 'title ')[ 1 rating 1 ]. std( ) 

# Filter down to active_titles 

In [92]: ratlng_std_by_tltle = ratlng_std_by_tltle.loc[actlve_tltles] 

# Order Series by value in descending order 

In [93]: ratlng_std_by_tltle.sort_values(ascendlng=False) [: 10] 

0ut[93] : 
title 

Dumb & Dumber (1994) 

Blair Witch Project, The (1999) 

Natural Born Killers (1994) 

Tank Girl (1995) 

Rocky Horror Picture Show, The (1975) 

Eyes Wide Shut (1999) 

Evita (1996) 

Billy Madison (1995) 

Fear and Loathing in Las Vegas (1998) 

Bicentennial Man (1999) 

Name: rating, dtype: float64 

You may have noticed that movie genres are given as a pipe-separated (|) string. If 
you wanted to do some analysis by genre, more work would be required to transform 
the genre information into a more usable form. 

14.3 US Baby Names 1880-2010 

The United States Social Security Administration (SSA) has made available data on 
the frequency of baby names from 1880 through the present. Hadley Wickham, an 
author of several popular R packages, has often made use of this dataset in illustrating 
data manipulation in R. 

We need to do some data wrangling to load this dataset, but once we do that we will 
have a DataFrame that looks like this: 


In [4]: names.head(10) 
0ut[4] : 



name 

sex 

births 

year 

0 

Mary 

F 

7065 

1880 

1 

Anna 

F 

2604 

1880 

2 

Emma 

F 

2003 

1880 

3 

Elizabeth 

F 

1939 

1880 

4 

Minnie 

F 

1746 

1880 

5 

Margaret 

F 

1578 

1880 

6 

Ida 

F 

1472 

1880 


1.321333 

1.316368 

1.307198 

1.277695 

1.260177 

1.259624 

1.253631 

1.249970 

1.246408 

1.245533 
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7 

Alice 

F 

1414 

1880 

8 

Bertha 

F 

1320 

1880 

9 

Sarah 

F 

1288 

1880 


There are many things you might want to do with the dataset: 

• Visualize the proportion of babies given a particular name (your own, or another 
name) over time 

• Determine the relative rank of a name 

• Determine the most popular names in each year or the names whose popularity 
has advanced or declined the most 

• Analyze trends in names: vowels, consonants, length, overall diversity, changes in 
spelling, first and last letters 

• Analyze external sources of trends: biblical names, celebrities, demographic 
changes 

With the tools in this book, many of these kinds of analyses are within reach, so I will 
walk you through some of them. 

As of this writing, the US Social Security Administration makes available data files, 
one per year, containing the total number of births for each sex/name combination. 
The raw archive of these files can be obtained from http://www.ssa.gov/oact/baby 
names/limits, html. 

In the event that this page has been moved by the time you’re reading this, it can most 
likely be located again by an internet search. After downloading the “National data” 
file names.zip and unzipping it, you will have a directory containing a series of files 
like yobl880.txt. I use the Unix head command to look at the first 10 lines of one of 
the files (on Windows, you can use the more command or open it in a text editor): 

In [94]: !head -n 10 datasets/babynames/yobl880.txt 

Mary, F, 7065 

Anna, F, 2604 

Emma, F, 2003 

Elizabeth,F, 1939 

Minnie,F, 1746 

Margaret,F, 1578 

Ida, F, 1472 

Alice,F, 1414 

Bertha,F, 1320 

Sarah,F, 1288 

As this is already in a nicely comma-separated form, it can be loaded into a Data- 
Frame with pandas. read_csv: 
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In [95]: import as 


In [96]: namesl880 = pd.read_csv(' datasets/babynames/yobl880.txt' , 

names=[ 1 name 1 , 'sex', 'births']) 


In [97]: namesl880 
0ut[97] : 



name 

sex 

births 

0 

Mary 

F 

7065 

1 

Anna 

F 

2604 

2 

Emma 

F 

2003 

3 

Elizabeth 

F 

1939 

4 

Minnie 

F 

1746 

1995 

Hoodie 

M 

5 

1996 

Worthy 

M 

5 

1997 

Wright 

M 

5 

1998 

York 

M 

5 

1999 

Zachariah 

M 

5 


[2000 rows x 3 columns] 

These files only contain names with at least five occurrences in each year, so for sim¬ 
plicity’s sake we can use the sum of the births column by sex as the total number of 
births in that year: 

In [98]: namesl880.groupby( 'sex '). births.sum( ) 

0ut[98] : 
sex 

F 90993 

M 110493 

Name: births, dtype: int64 

Since the dataset is split into files by year, one of the first things to do is to assemble 
all of the data into a single DataFrame and further to add a year field. You can do this 
using pandas. concat: 

years = range(1880, 2011) 
pieces = [] 

columns = ['name', 'sex', 'births'] 
for year in years: 

path = 'datasets/babynames/yob%d.txt' % year 
frame = pd.read_csv(path, names=columns) 

f rame[ 1 year 1 ] = year 
pieces.append(frame) 

# Concatenate everything into a single DataFrane 
names = pd.concat(pieces, ignore_index=True) 
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There are a couple things to note here. First, remember that concat glues the Data- 
Frame objects together row-wise by default. Secondly, you have to pass 
ignore_index=True because we’re not interested in preserving the original row num¬ 
bers returned from read_csv. So we now have a very large DataFrame containing all 
of the names data: 


In [100]: names 
Out[100] : 



name 

sex 

births 

year 

0 

Mary 

F 

7065 

1880 

1 

Anna 

F 

2604 

1880 

2 

Emma 

F 

2003 

1880 

3 

Elizabeth 

F 

1939 

1880 

4 

Minnie 

F 

1746 

1880 

1690779 

Zymalre 

M 

5 

2010 

1690780 

Zyonne 

M 

5 

2010 

1690781 

Zyquarlus 

M 

5 

2010 

1690782 

Zyran 

M 

5 

2010 

1690783 

Zzyzx 

M 

5 

2010 


[1690784 rows x 4 columns] 

With this data in hand, we can already start aggregating the data at the year and sex 
level using groupby or pivot_table (see Figure 14-4): 

In [101]: total_blrths = names.pivot_table( 1 births' , lndex= 1 year' , 

.: columns='sex', aggfunc=sum) 


In [102]: total_blrths.tall() 


Out[102] : 


sex 

year 

F 

M 

2006 

1896468 

2050234 

2007 

1916888 

2069242 

2008 

1883645 

2032310 

2009 

1827643 

1973359 

2010 

1759010 

1898382 


In [103]: total_blrths.plot(title= 'Total births by sex and year') 
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Total births by sex and year 



1880 1900 1920 1940 1960 1980 2000 

year 


Figure 14-4. Total births by sex and year 

Next, let’s insert a column prop with the fraction of babies given each name relative to 
the total number of births. A prop value of 0.02 would indicate that 2 out of every 
100 babies were given a particular name. Thus, we group the data by year and sex, 
then add the new column to each group: 

def addprop(group) : 

groupf 'prop 1 ] = group.births / group.births. sum() 
return group 

names = names,groupby( [' year 1 , 'sex' ]) .apply(add_prop) 

The resulting complete dataset now has the following columns: 


In [105]: names 
Out[105] : 



name 

sex 

births 

year 

prop 

0 

Mary 

F 

7065 

1880 

0.077643 

1 

Anna 

F 

2604 

1880 

0.028618 

2 

Emma 

F 

2003 

1880 

0.022013 

3 

Elizabeth 

F 

1939 

1880 

0.021309 

4 

Minnie 

F 

1746 

1880 

0.019188 

1690779 

Zymaire 

M 

5 

2010 

0.000003 

1690780 

Zyonne 

M 

5 

2010 

0.000003 

1690781 

Zyquarius 

M 

5 

2010 

0.000003 
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1690782 Zyran M 5 2010 0.000003 

1690783 Zzyzx M 5 2010 0.000003 

[1690784 rows x 5 columns] 

When performing a group operation like this, it’s often valuable to do a sanity check, 
like verifying that the prop column sums to 1 within all the groups: 

In [106]: names,groupby( [' year' , ' sex 1 ]). prop.sum( ) 


Out[106] : 


year 

sex 


1880 

F 

1.0 


M 

1.0 

1881 

F 

1.0 


M 

1.0 

1882 

F 

1.0 

2008 

M 

1.0 

2009 

F 

1.0 


M 

1.0 

2010 

F 

1.0 


M 

1.0 

Name: 

prop. 

Length: 262, dtype: float64 


Now that this is done, I’m going to extract a subset of the data to facilitate further 
analysis: the top 1,000 names for each sex/year combination. This is yet another 
group operation: 

def get_topl000(group) : 

return group.sort_values(by= 1 births 1 , ascendtng=False) [: 1000] 
grouped = names,groupby( [ 1 year' , 'sex']) 
topl000 = grouped.apply(get_topl000) 

# Drop the group index, not needed 
topl000.reset_lndex(inplace=True, drop=T rue) 

If you prefer a do-it-yourself approach, try this instead: 

pieces = [] 

for year, group in names.groupby( [' year 1 , 'sex']): 

pieces.append(group.sort_values(by=' births' , ascendtng=False) [: 1000] ) 
topl000 = pd.concat(pleces, lgnore_lndex=True) 

The resulting dataset is now quite a bit smaller: 


In [108]: topl000 
Out[108] : 



name 

sex 

births 

year 

prop 

0 

Mary 

F 

7065 

1880 

0.077643 

1 

Anna 

F 

2604 

1880 

0.028618 

2 

Emma 

F 

2003 

1880 

0.022013 

3 

Elizabeth 

F 

1939 

1880 

0.021309 

4 

Minnie 

F 

1746 

1880 

0.019188 

261872 

Camllo 

M 

194 

2010 

0.000102 

261873 

Destln 

M 

194 

2010 

0.000102 
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261874 

laquan 

M 

194 

2010 

0.000102 

261875 

laydan 

M 

194 

2010 

0.000102 

261876 

Maxton 

M 

193 

2010 

0.000102 


[261877 rows x 5 columns] 

We’ll use this Top 1,000 dataset in the following investigations into the data. 

Analyzing Naming Trends 

With the full dataset and Top 1,000 dataset in hand, we can start analyzing various 
naming trends of interest. Splitting the Top 1,000 names into the boy and girl por¬ 
tions is easy to do first: 

In [109]: boys = topl000[topl000.sex == ' M' ] 

In [110]: girls = topl000[topl000.sex == 'F 1 ] 

Simple time series, like the number of Johns or Marys for each year, can be plotted 
but require a bit of munging to be more useful. Let’s form a pivot table of the total 
number of births by year and name: 

In [111]: total_blrths = topl000.plvot_table( 1 births 1 , index=' year' , 

.: columns='name', 

.: aggfunc=sum) 

Now, this can be plotted for a handful of names with DataFrame’s plot method 
(Figure 14-5 shows the result): 

In [112]: total_births.info( ) 

<class ' pandas .core.frame.DataFrame '> 

Int64Index: 131 entries, 1880 to 2010 
Columns: 6868 entries, Aaden to Zuri 
dtypes: float64(6868) 
memory usage: 6.9 MB 

In [113]: subset = total_births[ [' John' , 'Harry', 'Mary', 'Marilyn']] 

In [114]: subset.plot(subplots=True, figsize=(12, 10), grid=False, 

.: title="Number of births per year") 
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Figure 14-5. A few boy and girl names over time 


On looking at this, you might conclude that these names have grown out of favor 
with the American population. But the story is actually more complicated than that, 
as will be explored in the next section. 

Measuring the increase in naming diversity 

One explanation for the decrease in plots is that fewer parents are choosing common 
names for their children. This hypothesis can be explored and confirmed in the data. 
One measure is the proportion of births represented by the top 1,000 most popular 
names, which I aggregate and plot by year and sex (Figure 14-6 shows the resulting 
plot): 
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In [116]: table = topl000.pivot_table( 'prop' , index= 1 year' , 

.: columns='sex', aggfunc=sum) 


In [117]: table.plot(title= 'Sun of tablel000.prop by year and sex', 

.: ytlcks=np.Ilnspace(0, 1.2, 13), xtlcks=range(1880, 2020, 10) 

) 
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Figure 14-6. Proportion of births represented in top 1000 names by sex 


You can see that, indeed, there appears to be increasing name diversity (decreasing 
total proportion in the top 1,000). Another interesting metric is the number of dis¬ 
tinct names, taken in order of popularity from highest to lowest, in the top 50% of 
births. This number is a bit more tricky to compute. Let’s consider just the boy names 
from 2010: 

In [118]: df = boys[boys.year == 2010] 


In [119]: df 
0ut[119] : 



name 

sex 

births 

year 

prop 

260877 

Jacob 

M 

21875 

2010 

0.011523 

260878 

Ethan 

M 

17866 

2010 

0.009411 

260879 

Michael 

M 

17133 

2010 

0.009025 

260880 

Jayden 

M 

17030 

2010 

0.008971 

260881 

William 

M 

16870 

2010 

0.008887 
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261872 

Camilo 

M 

194 

2010 

0.000102 

261873 

Destin 

M 

194 

2010 

0.000102 

261874 

Jaquan 

M 

194 

2010 

0.000102 

261875 

Jaydan 

M 

194 

2010 

0.000102 

261876 

Maxton 

M 

193 

2010 

0.000102 


[1000 rows x 5 columns] 

After sorting prop in descending order, we want to know how many of the most pop¬ 
ular names it takes to reach 50%. You could write a for loop to do this, but a vector¬ 
ized NumPy way is a bit more clever. Taking the cumulative sum, curnsum, of prop and 
then calling the method searchsorted returns the position in the cumulative sum at 
which 0.5 would need to be inserted to keep it in sorted order: 

In [120]: prop_cumsum = df.sort_values(by= 1 prop' , ascendlng=False).prop.cumsum( ) 

In [121]: prop_cumsum[ : 10] 

0ut[121]: 

260877 0.011523 

260878 0.020934 

260879 0.029959 

260880 0.038930 

260881 0.047817 

260882 0.056579 

260883 0.065155 

260884 0.073414 

260885 0.081528 

260886 0.089621 

Name: prop, dtype: float64 

In [122]: prop_cumsum.values.searchsorted(0. 5) 

0ut[122]: 116 

Since arrays are zero-indexed, adding 1 to this result gives you a result of 117. By con¬ 
trast, in 1900 this number was much smaller: 

In [123]: df = boys[boys.year == 1900] 

In [124]: lnl900 = df.sort_values(by= 'prop' , ascending=False).prop.cumsumQ 

In [125]: lnl900.values.searchsorted(0. 5) + 1 
0ut[125]: 25 

You can now apply this operation to each year/sex combination, groupby those fields, 
and apply a function returning the count for each group: 

def get_quantile_count(group, q=0.5): 

group = group.sort_values(by=' prop 1 , ascending=False) 
return group.prop.cumsum( ) .values.searchsorted(q) + 1 

diversity = topl000.groupby( [ 'year' , 'sex' ]). apply(get_quantile_count) 
diversity = diversity.unstack( 'sex' ) 
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This resulting DataFrame diversity now has two time series, one for each sex, 
indexed by year. This can be inspected in IPython and plotted as before (see 
Figure 14-7): 

In [128]: diversity. headQ 
0ut[128] : 
sex F M 
year 

1880 38 14 

1881 38 14 

1882 38 15 

1883 39 15 

1884 39 16 

In [129]: diversity ,plot(title="Number of popular names in top 5054") 


Number of popular names in top 50% 



1880 1900 1920 1940 1960 1980 2000 

year 


Figure 14-7. Plot of diversity metric by year 

As you can see, girl names have always been more diverse than boy names, and they 
have only become more so over time. Further analysis of what exactly is driving the 
diversity, like the increase of alternative spellings, is left to the reader. 
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The "last letter" revolution 


In 2007, baby name researcher Laura Wattenberg pointed out on her website that the 
distribution of boy names by final letter has changed significantly over the last 100 
years. To see this, we first aggregate all of the births in the full dataset by year, sex, and 
final letter: 


# extract last letter from name column 
get_tast_letter = lambda x: x [-1] 
last_letters = names.name.map(get_last_letter) 
last letters.name = 'last letter' 


table = names.pivot_table( 'births' , lndex=last_letters, 

columns=[ 'sex' , 'year'], aggfunc=sum) 

Then we select out three representative years spanning the history and print the first 
few rows: 

In [131]: subtable = table. relndex(columns=[1910, 1960, 2010], level= 'year' ) 


In [132]: subtable.head() 

0ut[132] : 

sex F M 


year 

last_letter 

1910 

1960 

2010 

1910 

1960 

2010 

a 

108376.0 

691247.0 

670605.0 

977.0 

5204.0 

28438.0 

b 

NaN 

694.0 

450.0 

411.0 

3912.0 

38859.0 

c 

5.0 

49.0 

946.0 

482.0 

15476.0 

23125.0 

d 

6750.0 

3729.0 

2607.0 

22111.0 

262112.0 

44398.0 

e 

133569.0 

435013.0 

313833.0 

28655.0 

178823.0 

129012.0 


Next, normalize the table by total births to compute a new table containing propor¬ 
tion of total births for each sex ending in each letter: 

In [133]: subtable.sum() 


0ut[133] : 


sex year 


F 1910 

396416.0 

1960 

2022062.0 

2010 

1759010.0 

M 1910 

194198.0 

1960 

2132588.0 

2010 

1898382.0 

dtype: float64 


In [134]: 

letter_prop 

= subtable 

/ subtable. 

sum( ) 



In [135]: 
0ut[135]: 

letter_prop 






sex 

F 



M 



year 

1910 

1960 

2010 

1910 

1960 

2010 

last_letter 






a 

0.273390 

0.341853 

0.381240 0 

.005031 

0.002440 

0.014980 
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b 

NaN 

0.000343 

0.000256 

0.002116 

0.001834 

0.020470 

c 

0.000013 

0.000024 

0.000538 

0.002482 

0.007257 

0.012181 

d 

0.017028 

0.001844 

0.001482 

0.113858 

0.122908 

0.023387 

e 

0.336941 

0.215133 

0.178415 

0.147556 

0.083853 

0.067959 

V 

NaN 

0.000060 

0.000117 

0.000113 

0.000037 

0.001434 

w 

0.000020 

0.000031 

0.001182 

0.006329 

0.007711 

0.016148 

X 

0.000015 

0.000037 

0.000727 

0.003965 

0.001851 

0.008614 

y 

0.110972 

0.152569 

0.116828 

0.077349 

0.160987 

0.058168 

z 

[26 

0.002439 
rows x 6 columns] 

0.000659 

0.000704 

0.000170 

0.000184 

0.001831 


With the letter proportions now in hand, we can make bar plots for each sex broken 
down by year (see Figure 14-8): 

import matplotlib.pyplot as pit 


fig, axes = plt.subplots(2, 1, figsize=(10, 8)) 
letter_prop[ 'M' ]. plot(kind=' bar' , rot=0, ax=axes[0], title='Male' ) 
letter_prop[ 1 F’ ]. plot(kind=' bar' , rot=0, ax=axes[l], title= 1 Female' , 
legend=False) 



Figure 14-8. Proportion of boy and girl names ending in each letter 
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As you can see, boy names ending in n have experienced significant growth since the 
1960s. Going back to the full table created before, I again normalize by year and sex 
and select a subset of letters for the boy names, finally transposing to make each col¬ 
umn a time series: 

In [138]: tetter_prop = table / table. sum() 

In [139]: dny_ts = letter_prop. toc[ [ 'd' , ' n' , ' y'] , ' M' ]. T 

In [140]: dny_ts.head( ) 

Out [140] : 

last_letter d n y 

year 

1880 0.083055 0.153213 0.075760 

1881 0.083247 0.153214 0.077451 

1882 0.085340 0.149560 0.077537 

1883 0.084066 0.151646 0.079144 

1884 0.086120 0.149915 0.080405 


With this DataFrame of time series in hand, I can make a plot of the trends over time 
again with its plot method (see Figure 14-9): 



Figure 14-9. Proportion of boys born with names ending in dlnly over time 
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Boy names that became girl names (and vice versa) 

Another fun trend is looking at boy names that were more popular with one sex ear¬ 
lier in the sample but have “changed sexes” in the present. One example is the name 
Lesley or Leslie. Going back to the topl000 DataFrame, I compute a list of names 
occurring in the dataset starting with “lesl”: 

In [144]: atl_names = pd.Series(topl000.nane.uniqueQ) 

In [145]: tesley_ltke = all_names[atl_names.str.towerQ•str.contains( 1 lesl' )] 

In [146]: tesley_like 
0ut[146]: 

632 Leslie 

2294 Lesley 

4262 Leslee 

4728 Lesli 

6103 Lesly 

dtype: object 

From there, we can filter down to just those names and sum births grouped by name 
to see the relative frequencies: 

In [147]: filtered = topl000[topl000.name.isin(lesley_like)] 


In [148]: filtered.groupby(' name 1 ). births.sun( ) 

0ut[148] : 
name 

Leslee 1082 

Lesley 35022 

Lesli 929 

Leslie 370429 

Lesly 10067 

Name: births, dtype: int64 

Next, let’s aggregate by sex and year and normalize within year: 

In [149]: table = filtered.pivot_table(' births' , index= 'year 1 , 

.: columns='sex', aggfunc='sun 1 ) 

In [150]: table = table.div(table.sum(l), axis=0) 


In [151]: 
0ut[151]: 
sex F 
year 

2006 1.0 

2007 1.0 

2008 1.0 

2009 1.0 

2010 1.0 


table. tailQ 

M 

NaN 

NaN 

NaN 

NaN 

NaN 
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Lastly, it’s now possible to make a plot of the breakdown by sex over time 
(Figure 14-10): 


In [153]: table.plot(style={ 'M' : 'k-', ' F' : 'k- -' }) 



Figure 14-10. Proportion of male/female Lesley-like names over time 


14.4 USDA Food Database 

The US Department of Agriculture makes available a database of food nutrient infor¬ 
mation. Programmer Ashley Williams made available a version of this database in 
JSON format. The records look like this: 

{ 

"id": 21441, 

"description": "KENTUCKY FRIED CHICKEN, Fried Chicken, EXTRA CRISPY, 

Wing, neat and skin with breading", 

"tags": [ "KFC" ], 

"manufacturer" : "Kentucky Fried Chicken", 

"group": "Fast Foods", 

"portions": [ 

{ 

"amount": 1, 

"unit": "wing, with skin", 

"grams": 68.0 
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}. 


L 

"nutrients": [ 

{ 

"value": 20.8, 

"units": "g", 
"description": "Protein", 
"group": "Composition" 

}. 


] 

} 

Each food has a number of identifying attributes along with two lists of nutrients and 
portion sizes. Data in this form is not particularly amenable to analysis, so we need to 
do some work to wrangle the data into a better form. 

After downloading and extracting the data from the link, you can load it into Python 
with any JSON library of your choosing. I’ll use the built-in Python j son module: 

In [154]: import 

In [155]: db = json . load(open( 'datasets/usda_food/database.json' )) 

In [156]: len(db) 

0ut[156]: 6636 

Each entry in db is a diet containing all the data for a single food. The ' nutrients' 
field is a list of diets, one for each nutrient: 

In [157]: db[0].keys() 

0ut[157]: dict_keys( [ 1 id' , 'description', 'tags', 'manufacturer', 'group', 'porti 
ons ', ' nutrients ']) 

In [158]: db[0] [ 1 nutrients '][0] 

0ut[158]: 

{'description': 'Protein', 

'group': 'Composition', 

'units' : 'g' , 

'value' : 25.18} 

In [159]: nutrients = pd.DataFrame(db[0] [' nutrients ']) 

In [160]: nutrients[ :7] 

Out[160]: 

description group units value 

0 Protein Composition g 25.18 

1 Total lipid (fat) Composition g 29.20 

2 Carbohydrate, by difference Composition g 3.06 

3 Ash Other g 3.28 
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4 Energy Energy kcal 376.00 

5 Water Composition g 39.28 

6 Energy Energy k3 1573.00 

When converting a list of diets to a DataFrame, we can specify a list of fields to 
extract. We’ll take the food names, group, ID, and manufacturer: 

In [161]: info_keys = ['description', 'group', 'id', ’manufacturer'] 

In [162]: info = pd.DataFrame(db, columns=info_keys) 


In [163]: info[:5] 
0ut[163]: 



description 



group 

id 

0 

Cheese, caraway 

Dairy and 

Egg 

Products 

1008 

1 

Cheese, Cheddar 

Dairy and 

Egg 

Products 

1009 

2 

Cheese, edam 

Dairy and 

Egg 

Products 

1018 

3 

Cheese, feta 

Dairy and 

Egg 

Products 

1019 

4 Cheese, 

mozzarella, part skim milk 

Dairy and 

Egg 

Products 

1028 


manufacturer 

0 

1 

2 

3 

4 


In [164]: info.info() 

<class ' pandas .core.frame.DataFrame '> 

Rangelndex: 6636 entries, 0 to 6635 
Data columns (total 4 columns): 
description 6636 non null object 

group 6636 non-null object 

id 6636 non-null int64 

manufacturer 5195 non-null object 

dtypes: int64(l), object(3) 
memory usage: 207.5+ KB 

You can see the distribution of food groups with value_counts: 


In [165]: pd.value_counts(info.group)[:10] 


0ut[165] : 

Vegetables and Vegetable Products 812 
Beef Products 618 
Baked Products 496 
Breakfast Cereals 403 
Fast Foods 365 
Legumes and Legume Products 365 
Lamb, Veal, and Game Products 345 
Sweets 341 
Pork Products 328 
Fruits and Fruit Juices 328 
Name: group, dtype: int64 
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Now, to do some analysis on all of the nutrient data, it’s easiest to assemble the 
nutrients for each food into a single large table. To do so, we need to take several 
steps. First, I’ll convert each list of food nutrients to a DataFrame, add a column for 
the food id, and append the DataFrame to a list. Then, these can be concatenated 
together with concat: 

If all goes well, nutrients should look like this: 


In [167]: nutrients 
0ut[167]: 




description 

group 

units 

value 

id 

0 


Protein 

Composition 

9 

25.180 

1008 

1 


Total lipid (fat) 

Composition 

g 

29.200 

1008 

2 


Carbohydrate, by difference 

Composition 

g 

3.060 

1008 

3 


Ash 

Other 

g 

3.280 

1008 

4 


Energy 

Energy 

kcal 

376.000 

1008 

389350 


Vitamin B-12, added 

Vitamins 

meg 

0.000 

43546 

389351 


Cholesterol 

Other 

mg 

0.000 

43546 

389352 


Fatty acids, total saturated 

Other 

g 

0.072 

43546 

389353 

Fatty 

acids, total monounsaturated 

Other 

g 

0.028 

43546 

389354 

Fatty 

acids, total polyunsaturated 

Other 

g 

0.041 

43546 


[389355 rows x 5 columns] 

I noticed that there are duplicates in this DataFrame, so it makes things easier to drop 
them: 

In [168]: nutrients.duplicated(),sum() # number of duplicates 
0ut[168]: 14179 

In [169]: nutrients = nutrients.drop_duplicates() 

Since 'group' and 'description' are in both DataFrame objects, we can rename for 
clarity: 

In [170]: col_mapping = {'description' : 'food', 

. : ' group' : 'fgroup' } 

In [171]: info = info.rename(columns=col_mapping, copy=False) 

In [172]: info.infoQ 

<class ' pandas .core.frame.DataFrame '> 

Rangelndex: 6636 entries, 0 to 6635 
Data columns (total 4 columns): 
food 6636 non-null object 

fgroup 6636 non-null object 

id 6636 non-null int64 

manufacturer 5195 non-null object 

dtypes: int64(l), object(3) 
memory usage: 207.5+ KB 

In [173]: col_mapping = {'description' : 'nutrient'. 
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.: 1 group' : 'nutgroup' } 

In [174]: nutrients = nutrients.rename(columns=col_mapping, copy=False) 


In [175]: nutrients 
0ut[175]: 




nutrient 

nutgroup 

units 

value 

id 

0 


Protein 

Composition 

9 

25.180 

1008 

1 


Total lipid (fat) 

Composition 

g 

29.200 

1008 

2 


Carbohydrate, by difference 

Composition 

g 

3.060 

1008 

3 


Ash 

Other 

g 

3.280 

1008 

4 


Energy 

Energy 

kcal 

376.000 

1008 

389350 


Vitamin B-12, added 

Vitamins 

meg 

0.000 

43546 

389351 


Cholesterol 

Other 

mg 

0.000 

43546 

389352 


Fatty acids, total saturated 

Other 

g 

0.072 

43546 

389353 

Fatty 

acids, total monounsaturated 

Other 

g 

0.028 

43546 

389354 

Fatty 

acids, total polyunsaturated 

Other 

g 

0.041 

43546 


[375176 rows x 5 columns] 

With all of this done, were ready to merge info with nutrients: 

In [176]: ndata = pd,merge(nutrients, info, on='id', how='outer') 


In [177]: ndata. infoQ 
<class ' pandas .core.frame.DataFrame '> 
Int64Index: 375176 entries, 0 to 375175 
Data columns (total 8 columns): 
nutrient 375176 non-null object 

nutgroup 375176 non-null object 

units 375176 non-null object 

value 375176 non-null float64 

id 375176 non-null int64 

food 375176 non-null object 

fgroup 375176 non-null object 

manufacturer 293054 non-null object 

dtypes: float64(l), int64(l), object(6) 
memory usage: 25.8+ MB 


In [178]: ndata. iloc[30000] 
0ut[178] : 


nutrient Glycine 
nutgroup Amino Acids 
units g 
value 0.04 
id 6158 
food Soup, tomato bisque, canned, condensed 
fgroup Soups, Sauces, and Gravies 
manufacturer 

Name: 30000, dtype: object 


We could now make a plot of median values by food group and nutrient type (see 
Figure 14-11): 
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In [180]: result = ndata.groupby( [' nutrient' , 'fgroup' ])[' value 1 ] .quantile(0.5) 


In [181]: result[ 'Zinc, Zn 1 ]. sort_values( ). plot(klnd= 1 barh' ) 


Beef Products 
Lamb, Veal, and Game Products 
Nut and Seed Products 
Breakfast Cereals 
Spices and Herbs 
Poultry Products 
Pork Products 
Sausages and Luncheon Meats 
Snacks 

Dairy and Egg Products 
Fast Foods 
Legumes and Legume Products 
o Cereal Grains and Pasta 

P Ethnic Foods 

Restaurant Foods 
Finfish and Shellfish Products 
Baked Products 
Meals, Entrees, and Sidedishes 
Baby Foods 
Sweets 

Vegetables and Vegetable Products 
Soups, Sauces, and Gravies 
Fruits and Fruit Juices 
Beverages 
Fats and Oils 



0 1 2 3 4 5 


Figure 14-11. Median zinc values by nutrient group 


With a little cleverness, you can find which food is most dense in each nutrient: 

by_nutrient = ndata.groupby( [' nutgroup 1 , 'nutrient']) 

get_maximum = lambda x: x.loc[x. value. idxmaxQ] 
get_minimum = lambda x: x.loc[x. value. idxmlnO] 

max_foods = by_nutrient.apply(get_maxlmum) [[' value' , 'food']] 

# make the food a little smaller 
max_foods.food = max_foods.food. str[ : 50] 

The resulting DataFrame is a bit too large to display in the book; here is only the 
' Amino Acids 1 nutrient group: 

In [183]: max_foods ,loc[ 'Amino Acids '][' food' ] 

0ut[183]: 
nutrient 
Alanine 
Arginine 
Aspartic acid 
Cystine 
Glutamic acid 

Serine 
Threonine 
Tryptophan 
Tyrosine 


Gelatins, dry powder, unsweetened 
Seeds, sesame flour, low-fat 
Soy protein isolate 
Seeds, cottonseed flour, low fat (glandless) 
Soy protein isolate 

Soy protein isolate, PROTEIN TECHNOLOGIES INTE... 
Soy protein isolate, PROTEIN TECHNOLOGIES INTE... 

Sea lion, Steller, meat with fat (Alaska Native) 
Soy protein isolate, PROTEIN TECHNOLOGIES INTE... 
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Valine Soy protein Isolate, PROTEIN TECHNOLOGIES INTE... 

Name: food. Length: 19, dtype: object 


14.5 2012 Federal Election Commission Database 

The US Federal Election Commission publishes data on contributions to political 
campaigns. This includes contributor names, occupation and employer, address, and 
contribution amount. An interesting dataset is from the 2012 US presidential elec¬ 
tion. A version of the dataset I downloaded in June 2012 is a 150 megabyte CSV file 
POOOOOOOl-ALL.csv (see the book’s data repository), which can be loaded with pan 
das.read_csv: 

In [184]: fee = pd.read_csv( 1 datasets/fec/P00000001-ALL.csv 1 ) 


In [185]: fec.tnfo() 

<class ' pandas. core.frame.DataFrane '> 
Rangelndex: 1001731 entries, 0 to 1001730 
Data columns (total 16 columns): 


cmte_ld 

cand_td 

cand_nm 

contbr_nm 

contbr_clty 

contbr_st 

contbr_zlp 

contbr_employer 

contbr_occupatlon 

contb_recelpt_amt 

contb_recelpt_dt 

recelpt_desc 

memo_cd 

memo_text 

form_tp 

file num 


1001731 non-null object 
1001731 non-null object 
1001731 non-null object 
1001731 non-null object 
1001712 non-null object 
1001727 non-null object 
1001620 non-null object 
988002 non-null object 
993301 non-null object 
1001731 non-null float64 
1001731 non-null object 
14166 non-null object 
92482 non-null object 
97770 non-null object 
1001731 non-null object 
1001731 non-null int64 


dtypes: float64(l), int64(l), object(14) 
memory usage: 122.3+ MB 

A sample record in the DataFrame looks like this: 


In [186]: fec.tloc[123456] 
0ut[186] : 

cmte_id C00431445 

cand id P80003338 


cand_nm 

contbr_nm 

contbr_clty 


Obama, Barack 
ELLMAN, IRA 
TEMPE 


receipt_desc NaN 
memo_cd NaN 
memo_text NaN 
form_tp SA17A 
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file_num 772372 

Name: 123456, Length: 16, dtype: object 

You may think of some ways to start slicing and dicing this data to extract informa¬ 
tive statistics about donors and patterns in the campaign contributions. I’ll show you 
a number of different analyses that apply techniques in this book. 

You can see that there are no political party affiliations in the data, so this would be 
useful to add. You can get a list of all the unique political candidates using unique: 

In [187]: unique_cands = fec.cand_nm.unique() 

In [188]: unique_cands 
0ut[188]: 

array([ 'Bachmann, Michelle', 'Romney, Mitt', 'Obama, Barack', 

"Roemer, Charles E. 'Buddy' III", 'Pawlenty, Timothy', 

'Johnson, Gary Earl', 'Paul, Ron', 'Santorum, Rick', 'Cain, Herman', 
'Gingrich, Newt', 'McCotter, Thaddeus G', 'Huntsman, Jon', 

'Perry, Rick 1 ], dtype=object) 

In [189]: unique_cands[2] 

0ut[189]: 'Obama, Barack' 

One way to indicate party affiliation is using a diet: 1 

parties = {'Bachmann, Michelle': 'Republican', 

'Cain, Herman': 'Republican', 

'Gingrich, Newt': 'Republican', 

’Huntsman, Jon’: ’Republican', 

'Johnson, Gary Earl': 'Republican', 

'McCotter, Thaddeus G': 'Republican', 

'Obama, Barack': 'Democrat', 

'Paul, Ron': 'Republican', 

'Pawlenty, Timothy': 'Republican', 

'Perry, Rick': 'Republican', 

"Roemer, Charles E. 'Buddy' III": ’Republican', 

'Romney, Mitt': 'Republican', 

'Santorum, Rick': 'Republican'} 

Now, using this mapping and the map method on Series objects, you can compute an 
array of political parties from the candidate names: 

In [191]: fee.cand_nm[ 123456 : 123461 ] 

0ut[191]: 

123456 Obama, Barack 

123457 Obama, Barack 

123458 Obama, Barack 

123459 Obama, Barack 

123460 Obama, Barack 


1 This makes the simplifying assumption that Gary Johnson is a Republican even though he later became the 
Libertarian party candidate. 


14.5 2012 Federal Election Commission Database | 441 




Name: cand_nm, dtype: object 

In [192]: fec.cand_nm[123456:123461]. map(parties) 

0ut[192] : 

123456 Democrat 
123457 Democrat 
123458 Democrat 
123459 Democrat 
123460 Democrat 
Name: cand_nm, dtype: object 

# Add it as a column 

In [193]: fec[' party'] = fec.cand_nm.map(parties) 

In [194]: fec[' party ']. value_counts( ) 

0ut[194] : 

Democrat 593746 

Republican 407985 

Name: party, dtype: int64 

A couple of data preparation points. First, this data includes both contributions and 
refunds (negative contribution amount): 

In [195]: (fec.contb_receipt_amt > 0) ,value_counts() 

0ut[195] : 

True 991475 

False 10256 

Name: contb_receipt_amt, dtype: int64 

To simplify the analysis, I’ll restrict the dataset to positive contributions: 

In [196]: fee = fec[fec.contb_recelpt_amt > 0] 

Since Barack Obama and Mitt Romney were the main two candidates, I’ll also pre¬ 
pare a subset that just has contributions to their campaigns: 

In [197]: fec_mrbo = fec[fec.cand_nm.isin([ 'Obama, Barack', 'Romney, Mitt'])] 

Donation Statistics by Occupation and Employer 

Donations by occupation is another oft-studied statistic. For example, lawyers (attor¬ 
neys) tend to donate more money to Democrats, while business executives tend to 
donate more to Republicans. You have no reason to believe me; you can see for your¬ 
self in the data. First, the total number of donations by occupation is easy: 

In [198]: fee.contbr_occupation.value_counts( )[: 10] 


0ut[198]: 

RETIRED 233990 
INFORMATION REQUESTED 35107 
ATTORNEY 34286 
HOMEMAKER 29931 
PHYSICIAN 23432 
INFORMATION REQUESTED PER BEST EFFORTS 21138 
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ENGINEER 

14334 

TEACHER 

13990 

CONSULTANT 

13273 

PROFESSOR 

Name: contbr_occupation, dtype: int64 

12555 

You will notice by looking at the occupations that many refer to the same basic job 
type, or there are several variants of the same thing. The following code snippet illus¬ 
trates a technique for cleaning up a few of them by mapping from one occupation to 
another; note the “trick” of using diet. get to allow occupations with no mapping to 

“pass through”: 


occ_mapping = { 


'INFORMATION REQUESTED PER BEST EFFORTS 

' : 'NOT PROVIDED', 

'INFORMATION REQUESTED' : 'NOT PROVIDED 
'INFORMATION REQUESTED (BEST EFFORTS)' : 

: 'NOT PROVIDED', 

'C.E.O.' : 'CEO' 

} 


# If no mapping provided, return x 
f = lambda x: occ_mapplng.get(x, x) 



fec.contbr_occupation = fec.contbr_occupation.map(f) 

I’ll also do the same thing for employers: 

emp_mapplng = { 

'INFORMATION REQUESTED PER BEST EFFORTS' : 'NOT PROVIDED', 

'INFORMATION REQUESTED' : 'NOT PROVIDED', 

'SELF' : 'SELF-EMPLOYED', 

'SELF EMPLOYED' : 'SELF-EMPLOYED', 

} 

# If no mapping provided, return x 
f = lambda x: emp_mapping.get(x, x) 
fec.contbr_empLoyer = fec.contbr_employer.map(f) 

Now, you can use pivot_table to aggregate the data by party and occupation, then 
filter down to the subset that donated at least $2 million overall: 

In [201]: by_occupatton = fee,ptvot_table( 'contb_receipt_amt' , 

.: index='contbr_occupation', 

.: columns='party', aggfunc='sum') 

In [202]: over_2mm = by_occupation[by_occupation.sum(l) > 2000000] 


In [203]: over_2mm 

Out[203] : 

party 

contbr_occupation 

ATTORNEY 

CEO 

CONSULTANT 

ENGINEER 


Democrat 

11141982.97 

2074974.79 

2459912.71 

951525.55 


Republican 

7.477194e+06 
4.211041e+06 
2.544725e+06 
1.818374e+06 
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EXECUTIVE 

PRESIDENT 
PROFESSOR 
REAL ESTATE 
RETIRED 
SELF-EMPLOYED 


1355161.05 

1878509.95 

2165071.08 

528902.09 

25305116.38 

672393.40 


4.138850e+06 

4.720924e+06 
2.967027e+05 
1.625902e+06 
2.356124e+07 
1.640253e+06 


[17 rows x 2 columns] 


It can be easier to look at this data graphically as a bar plot ( 1 barh 1 means horizontal 
bar plot; see Figure 14-12): 


In [205]: over_2mm.plot(klnd=' barh' ) 


SELF-EMPLOYED 
RETIRED 
REAL ESTATE 
PROFESSOR 
PRESIDENT 
PHYSICIAN 
OWNER 
NOT PROVIDED 
MANAGER 
LAWYER 
INVESTOR 
HOMEMAKER 
EXECUTIVE 
ENGINEER 
CONSULTANT 
CEO 
ATTORNEY 


party 

Democrat 

Republican 


0.0 


0.5 


1.0 


1.5 


2.0 


2.5 


le7 


Figure 14-12. Total donations by party for top occupations 


You might be interested in the top donor occupations or top companies that donated 
to Obama and Romney. To do this, you can group by candidate name and use a var¬ 
iant of the top method from earlier in the chapter: 

def get_top_amounts(group, key, n=5): 

totals = group. groupby(key) [ 1 contb_recelpt_amt' ] .sum() 
return totals.nlargest(n) 

Then aggregate by occupation and employer: 

In [207]: grouped = fec_mrbo.groupby( 'cand_nm' ) 

In [208]: grouped.apply(get_top_amounts, 'contbr_occupatlon 1 , n=7) 

Out[208] : 
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cand_nm contbr_occupatlon 

Obana, Barack RETIRED 
ATTORNEY 

INFORMATION REQUESTED 

HOMEMAKER 

PHYSICIAN 


Romney, Mitt HOMEMAKER 
ATTORNEY 
PRESIDENT 
EXECUTIVE 
C.E.O. 

Name: contb_receipt_amt. 


25305116.38 
11141982.97 
4866973.96 
4248875.80 
3735124.94 

8147446.22 
5364718.82 
2491244.89 
2300947.03 
1968386.11 
Length: 14, dtype: float64 


In [209]: grouped.apply(get_top_amounts. 

, 'contbr_employer' , n=10) 

Out[209] : 



cand_nm 

contbr_empioyer 


Obama, Barack 

RETIRED 

22694358.85 


SELF-EMPLOYED 

17080985.96 


NOT EMPLOYED 

8586308.70 


INFORMATION REQUESTED 

5053480.37 


HOMEMAKER 

2605408.54 

Romney, Mitt 

CREDIT SUISSE 

281150.00 


MORGAN STANLEY 

267266.00 


GOLDMAN SACH & CO. 

238250.00 


BARCLAYS CAPITAL 

162750.00 


H.I.G. CAPITAL 

139500.00 


Name: contb_receipt_amt. Length: 20, dtype: float64 


Bucketing Donation Amounts 

A useful way to analyze this data is to use the cut function to discretize the contribu¬ 
tor amounts into buckets by contribution size: 

In [210]: bins = np.array([0, 1, 10, 100, 1000, 10000, 

.: 100000 , 1000000 , 10000000 ]) 


In [211]: 

Labels = pd 

In [212]: 

labels 

Out[212] : 


411 

(10, 100] 

412 

(100, 1000] 

413 

(100, 1000] 

414 

(10, 100] 

415 

(10, 100] 

701381 

(10, 100] 

701382 

(100, 1000] 

701383 

(1, 10] 

701384 

(10, 100] 
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701385 (100, 1000] 

Name: contb_receipt_amt, Length: 694282, dtype: category 

Categories (8, interval[int64] ): [(0, 1] < (1, 10] < (10, 100] < (100, 1000] < (1 

000 , 10000 ] < 

(10000, 100000] < (100000, 1000000] < (1000000, 


10000000 ]] 


We can then group the data for Obama and Romney by name and bin label to get a 
histogram by donation size: 


In [213]: grouped = fec_nrbo.groupby([ 'cand_nm' , labels]) 


In [214]: grouped.size( ). unstack(0) 
0ut[214] : 


cand_nm 

contb receipt ant 

Obama, Barack 

Romney, Mitt 

(0, 1] 

493.0 

77.0 

(1, 10] 

40070.0 

3681.0 

(10, 100] 

372280.0 

31853.0 

(100, 1000] 

153991.0 

43357.0 

(1000, 10000] 

22284.0 

26186.0 

(10000, 100000] 

2.0 

1.0 

(100000, 1000000] 

3.0 

NaN 

(1000000, 10000000] 

4.0 

NaN 


This data shows that Obama received a significantly larger number of small donations 
than Romney. You can also sum the contribution amounts and normalize within 
buckets to visualize percentage of total donations of each size by candidate 
(Figure 14-13 shows the resulting plot): 

In [216]: bucket_sums = grouped.contb_recelpt_amt.sum().unstack(0) 


In [217]: normed_sums = bucket_sums.div(bucket_sums. sum(axts=l) , axis=0) 


In [218]: normed_suns 



0ut[218] : 



cand_nm 

Obama, Barack 

Romney, Mitt 

contb_receipt_amt 



(0, 1] 

0.805182 

0.194818 

(1, 10] 

0.918767 

0.081233 

(10, 100] 

0.910769 

0.089231 

(100, 1000] 

0.710176 

0.289824 

(1000, 10000] 

0.447326 

0.552674 

(10000, 100000] 

0.823120 

0.176880 

(100000, 1000000] 

1.000000 

NaN 

(1000000, 10000000] 

1.000000 

NaN 


In [219]: normed_sums[ : -2]. plot(kind= 'barh' ) 
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0.0 0.2 0.4 0.6 0.8 


Figure 14-13. Percentage of total donations received by candidates for each donation size 


I excluded the two largest bins as these are not donations by individuals. 

This analysis can be refined and improved in many ways. For example, you could 
aggregate donations by donor name and zip code to adjust for donors who gave many 
small amounts versus one or more large donations. I encourage you to download and 
explore the dataset yourself. 


Donation Statistics by State 

Aggregating the data by candidate and state is a routine affair: 


In [220]: 

grouped = fec_mrbo.groupby( [ 1 cand_nm 1 , 1 

In [221]: 

totals = grouped.contb_receipt_amt.sum() 

In [222]: 

totals = totals[totals . sun(l) > 100000] 

In [223]: 

totalsf : 10] 


0ut[223] : 



cand_nm 

Obama, Barack 

Romney, Mitt 

contbr_st 



AK 

281840.15 

86204.24 

AL 

543123.48 

527303.51 

AR 

359247.28 

105556.00 

AZ 

1506476.98 

1888436.23 

CA 

23824984.24 

11237636.60 

CO 

2132429.49 

1506714.12 

CT 

2068291.26 

3499475.45 
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DC 

4373538.80 

1025137.50 

DE 

336669.14 

82712.00 

FL 

7318178.58 

8338458.81 


If you divide each row by the total contribution amount, you get the relative percent¬ 
age of total donations by state for each candidate: 

In [224]: percent = totals.dtv(totals. sun(l) , axts=0) 


In [225]: percent[:10] 
0ut[225] : 


cand_nm 

contbr_st 

Obama, Barack 

Romney, Mitt 

AK 

0.765778 

0.234222 

AL 

0.507390 

0.492610 

AR 

0.772902 

0.227098 

AZ 

0.443745 

0.556255 

CA 

0.679498 

0.320502 

CO 

0.585970 

0.414030 

CT 

0.371476 

0.628524 

DC 

0.810113 

0.189887 

DE 

0.802776 

0.197224 

FL 

0.467417 

0.532583 


14.6 Conclusion 

We’ve reached the end of the book’s main chapters. I have included some additional 
content you may find useful in the appendixes. 

In the five years since the first edition of this book was published, Python has become 
a popular and widespread language for data analysis. The programming skills you 
have developed here will stay relevant for a long time into the future. I hope the pro¬ 
gramming tools and libraries we’ve explored serve you well in your work. 
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APPENDIX A 


Advanced NumPy 


In this appendix, I will go deeper into the NumPy library for array computing. This 
will include more internal detail about the ndarray type and more advanced array 
manipulations and algorithms. 

This appendix contains miscellaneous topics and does not necessarily need to be read 
linearly. 

A.1 ndarray Object Internals 

The NumPy ndarray provides a means to interpret a block of homogeneous data 
(either contiguous or strided) as a multidimensional array object. The data type, or 
dtype, determines how the data is interpreted as being floating point, integer, boolean, 
or any of the other types we’ve been looking at. 

Part of what makes ndarray flexible is that every array object is a strided view on a 
block of data. You might wonder, for example, how the array view arr[::2, ::-l] 
does not copy any data. The reason is that the ndarray is more than just a chunk of 
memory and a dtype; it also has “striding” information that enables the array to move 
through memory with varying step sizes. More precisely, the ndarray internally con¬ 
sists of the following: 

• A pointer to data —that is, a block of data in RAM or in a memory-mapped file 

• The data type or dtype, describing fixed-size value cells in the array 

• A tuple indicating the array’s shape 

• A tuple of strides, integers indicating the number of bytes to “step” in order to 
advance one element along a dimension 

See Figure A-1 for a simple mockup of the ndarray innards. 
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For example, a 10 x 5 array would have shape (10, 5): 

In [10]: np.ones((10, 5)). shape 
Out[10] : (10, 5) 

A typical (C order) 3x4x5 array of float64 (8-byte) values has strides (160, 40, 
8) (knowing about the strides can be useful because, in general, the larger the strides 
on a particular axis, the more costly it is to perform computation along that axis): 

In [11]: np.ones((3, 4, 5), dtype=np.float64).strides 
Out[ll] : (160, 40, 8) 

While it is rare that a typical NumPy user would be interested in the array strides, 
they are the critical ingredient in constructing “zero-copy” array views. Strides can 
even be negative, which enables an array to move “backward” through memory (this 
would be the case, for example, in a slice like obj [:: -1] or obj [:, ::-!]). 



Figure A-l. The NumPy ndarray object 

NumPy dtype Hierarchy 

You may occasionally have code that needs to check whether an array contains inte¬ 
gers, floating-point numbers, strings, or Python objects. Because there are multiple 
types of floating-point numbers (f loatl6 through floatl28), checking that the dtype 
is among a list of types would be very verbose. Fortunately, the dtypes have super¬ 
classes such as np.integer and np. floating, which can be used in conjunction with 
the np.issubdtype function: 

In [12]: ints = np.ones(10, dtype=np.uintl6) 

In [13]: floats = np.ones(10, dtype=np.float32) 

In [14]: np.issubdtype(ints.dtype, np.integer) 

0ut[14]: True 

In [15]: np.issubdtype(floats.dtype, np.floating) 

0ut[15]: True 

You can see all of the parent classes of a specific dtype by calling the type’s mro 
method: 
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In [16]: np.float64.mro() 
0ut[16] : 

[nunpy.float64, 
nunpy.floating, 
nunpy.Inexact, 
nunpy.nunber, 
nunpy.generic, 
float, 
object] 

Therefore, we also have: 


In [17]: np.issubdtype(ints.dtype, np.nunber) 
0ut[17]: True 


Most NumPy users will never have to know about this, but it occasionally comes in 
handy. See Figure A-2 for a graph of the dtype hierarchy and parent-subclass 
relationships. 1 



Figure A-2. The NumPy dtype class hierarchy 


A.2 Advanced Array Manipulation 

There are many ways to work with arrays beyond fancy indexing, slicing, and boolean 
subsetting. While much of the heavy lifting for data analysis applications is handled 
by higher-level functions in pandas, you may at some point need to write a data algo¬ 
rithm that is not found in one of the existing libraries. 


1 Some of the dtypes have trailing underscores in their names. These are there to avoid variable name conflicts 
between the NumPy-specific types and the Python built-in ones. 
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Reshaping Arrays 

In many cases, you can convert an array from one shape to another without copying 
any data. To do this, pass a tuple indicating the new shape to the reshape array 
instance method. For example, suppose we had a one-dimensional array of values 
that we wished to rearrange into a matrix (the result is shown in Figure A-3): 

In [18]: arr = np.arange(8) 

In [19]: arr 

0ut[19] : array( [0, 1, 2, 3, 4, 5, 6, 7]) 


In [20]: arr. reshape((4, 2)) 
Out[20] : 
array( [[0, 1], 

[2, 3], 

[4, 5], 

[6, 7]]) 


0 

1 

2 

3 

4 

5 

6 

7 

8 

9 

10 

11 


arr.reshape((4, 3), order=?) 


C order (row major) 


0 

1 

2 

3 

4 

5 

6 

7 

8 

9 

10 

11 


Fortran order (column major) 


0 

4 

8 

1 

5 

9 

2 

6 

10 

3 

7 

11 


order='C' 


order= 1 F' 


Figure A-3. Reshaping in C (row major) or Fortran (column major) order 


A multidimensional array can also be reshaped: 

In [21]: arr.reshape((4, 2)). reshape((2, 4)) 

0ut[21] : 

array( [[0, 1, 2, 3], 

[4, 5, 6, 7]]) 

One of the passed shape dimensions can be -1, in which case the value used for that 
dimension will be inferred from the data: 
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In [22]: arr = np.arange(15) 


In [23]: arr.reshape( (5, -1)) 
0ut[23] : 


array( [[ 0, 

1 , 

2], 

[ 3, 

4, 

5 ], 

[ 6, 

7, 

8 ], 

[ 9 , 

10, 

11], 

[12, 

13, 

14]]) 


Since an array’s shape attribute is a tuple, it can be passed to reshape, too: 

In [24]: other_arr = np.ones((3, 5)) 

In [25]: other_arr . shape 
0ut[25] : (3, 5) 

In [26]: arr.reshape(other_arr.shape) 

0ut[26] : 

array( [[ 0, 1, 2, 3, 4], 

[ 5, 6, 7, 8, 9], 

[10, 11, 12, 13, 14]]) 

The opposite operation of reshape from one-dimensional to a higher dimension is 
typically known as flattening or raveling: 

In [27]: arr = np.arange(15).reshape((5, 3)) 


In [28]: arr 
0ut[28] : 


array( [[ 0, 

1, 

2], 

[ 3, 

4, 

5], 

[ 6, 

7, 

8], 

[ 9, 

10, 

11], 

[12, 

13, 

14]]) 


In [29]: arr.ravet() 

0ut[29] : array( [ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14]) 

ravel does not produce a copy of the underlying values if the values in the result 

were contiguous in the original array. The flatten method behaves like ravel except 
it always returns a copy of the data: 

In [30]: arr.flatten( ) 

Out[30] : array( [ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14]) 

The data can be reshaped or raveled in different orders. This is a slightly nuanced 

topic for new NumPy users and is therefore the next subtopic. 
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C Versus Fortran Order 

NumPy gives you control and flexibility over the layout of your data in memory. By 
default, NumPy arrays are created in row major order. Spatially this means that if you 
have a two-dimensional array of data, the items in each row of the array are stored in 
adjacent memory locations. The alternative to row major ordering is column major 
order, which means that values within each column of data are stored in adjacent 
memory locations. 

For historical reasons, row and column major order are also know as C and Fortran 
order, respectively. In the FORTRAN 77 language, matrices are all column major. 

Functions like reshape and ravel accept an order argument indicating the order to 
use the data in the array. This is usually set to ' C' or ' F' in most cases (there are also 
less commonly used options ' A 1 and ' K'; see the NumPy documentation, and refer 
back to Figure A-3 for an illustration of these options): 

In [31]: arr = np.arange(12).reshape((3, 4)) 

In [32]: arr 
0ut[32] : 

array([[ 0, 1, 2, 3], 

[ 4, 5, 6, 7], 

[ 8, 9, 10, 11]]) 


In [33]: arr.ravelQ 
0ut[33]: array([ 0, 1, 

2, 

3, 

4, 

5, 

6, 

7, 

8, 

9, 

10, 

11]) 

In [34]: arr.ravel(' F 1 ) 
0ut[34]: array([ 0, 4, 

8, 

1, 

5, 

9, 

2, 

6, 

10, 

3, 

7, 

11]) 


Reshaping arrays with more than two dimensions can be a bit mind-bending (see 
Figure A-3). The key difference between C and Fortran order is the way in which the 
dimensions are walked: 

Cl row major order 

Traverse higher dimensions/zrsf (e.g., axis 1 before advancing on axis 0). 
Fortran/column major order 

Traverse higher dimensions last (e.g., axis 0 before advancing on axis 1). 

Concatenating and Splitting Arrays 

numpy. concatenate takes a sequence (tuple, list, etc.) of arrays and joins them 
together in order along the input axis: 

In [35]: arrl = np.array( [[1, 2, 3], [4, 5, 6]]) 

In [36]: arr2 = np.array([ [7, 8, 9], [10, 11, 12]]) 
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In [37]: np.concatenate([arrl, arr2], axis=0) 
0ut[37] : 


array( [[ 1, 

2, 

3], 

[ 4, 

5, 

6], 

[ 7, 

8, 

9], 

[10, 

11, 

12]]) 


In [38]: np.concatenate([arrl, arr2], axis=l) 

Out[38] : 

array( [[ 1, 2, 3, 7, 8, 9], 

[ 4, 5, 6, 10, 11, 12]]) 

There are some convenience functions, like vstack and hstack, for common kinds of 
concatenation. The preceding operations could have been expressed as: 

In [39]: np.vstack((arrl, arr2)) 

0ut[39] : 


array( [[ 1, 

2, 

3], 

[ 4, 

5, 

6], 

[ 7, 

8, 

9], 

[10, 

11, 

12]]) 


In [40]: np.hstack((arrl, arr2)) 

Out[40] : 

array([[ 1, 2, 3, 7, 8, 9], 

[ 4, 5, 6, 10, 11, 12]]) 

split, on the other hand, slices apart an array into multiple arrays along an axis: 

In [41]: arr = np.random.randn(5, 2) 

In [42]: arr 

0ut[42] : 

array( [[ -0.2047, 0.4789], 

[-0.5194, -0.5557], 

[ 1.9658, 1.3934], 

[ 0.0929, 0.2817], 

[ 0.769 , 1.2464]]) 

In [43]: first, second, third = np.split(arr, [1, 3]) 

In [44]: first 

0ut[44] : array([[ -0.2047, 0.4789]]) 

In [45]: second 

0ut[45] : 

array( [[ -0.5194, -0.5557], 

[ 1.9658, 1.3934]]) 

In [46]: third 

0ut[46] : 

array( [[ 0.0929, 0.2817], 

[ 0.769 , 1.2464]]) 
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The value [1, 3] passed to np.split indicate the indices at which to split the array 
into pieces. 

See Table A-l for a list of all relevant concatenation and splitting functions, some of 
which are provided only as a convenience of the very general-purpose concatenate. 


Table A-l. Array concatenation functions 


1 Function 

Description I 

concatenate 

Most general function, concatenates collection of arrays along one axis 

vstack, row_stack 

Stack arrays row-wise (along axis 0) 

hstack 

Stack arrays column-wise (along axis 1) 

column_stack 

Like hstack, but converts ID arrays to 2D column vectors first 

dstack 

Stack arrays "depth"-wise (along axis 2) 

split 

Split array at passed locations along a particular axis 

hsplit/vsplit 

Convenience functions for splitting on axis 0 and 1, respectively 


Stacking helpers: r_and c_ 

There are two special objects in the NumPy namespace, r_ and c_, that make stacking 
arrays more concise: 

In [47]: arr = np.arange(6) 

In [48]: arrl = arr.reshape((3, 2)) 

In [49]: arr2 = np.random.randn(3, 2) 

In [50]: np.r_[arrl, arr2] 

Out[50] : 

array( [[ 0. , 1. ], 

[2. ,3. ], 

[4. ,5. ], 

[ 1.0072, -1.2962], 

[ 0.275 , 0.2289], 

[ 1.3529, 0.8864]]) 

In [51]: np. c_[np. r_[arrl, arr2], arr] 

0ut[51] : 


array( [[ 

0. 

1 . 

0. 

1, 

[ 

2. 

3. 

1 . 

L 

[ 

4. 

5. 

2. 

L 

[ 

1.0072, 

-1.2962, 

3. 

1, 

[ 

0.275 , 

0.2289, 

4. 

L 

[ 

1.3529, 

0.8864, 

5. 

]]) 


These additionally can translate slices to arrays: 

In [52]: np.c_[l:6, -10: -5] 

0ut[52] : 
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array([[ 1, -10], 

[ 2, -9], 

[ 3, -8], 

[ 4, -7], 

[ 5 , - 6 ]]) 

See the docstring for more on what you can do with c_ and r_. 

Repeating Elements: tile and repeat 

Two useful tools for repeating or replicating arrays to produce larger arrays are the 
repeat and tile functions, repeat replicates each element in an array some number 
of times, producing a larger array: 

In [53]: arr = np.arange(3) 

In [54]: arr 

0ut[54]: array([0, 1, 2]) 

In [55]: arr.repeat(3) 

0ut[55] : array( [0, 0, 0, 1, 1, 1, 2, 2, 2]) 



The need to replicate or repeat arrays can be less common with 
NumPy than it is with other array programming frameworks like 
MATLAB. One reason for this is that broadcasting often fills this 
need better, which is the subject of the next section. 


By default, if you pass an integer, each element will be repeated that number of times. 
If you pass an array of integers, each element can be repeated a different number of 
times: 

In [56]: arr. repeat([2, 3, 4]) 

0ut[56] : array( [0, 0, 1, 1, 1, 2, 2, 2, 2]) 

Multidimensional arrays can have their elements repeated along a particular axis. 

In [57]: arr = np.random.randn(2, 2) 

In [58]: arr 
0ut[58] : 

array( [[ -2.0016, -0.3718], 

[ 1.669 , -0.4386]]) 

In [59]: arr.repeat(2, axls=0) 

0ut[59] : 

array( [[ -2.0016, -0.3718], 

[-2.0016, -0.3718], 

[ 1.669 , -0.4386], 

[ 1.669 , -0.4386]]) 
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Note that if no axis is passed, the array will be flattened first, which is likely not what 
you want. Similarly, you can pass an array of integers when repeating a multidimen¬ 
sional array to repeat a given slice a different number of times: 

In [60]: arr. repeat([2, 3], axis=0) 

Out[60] : 

array( [[ -2.0016, -0.3718], 

[-2.0016, -0.3718], 

[ 1.669 , -0.4386], 

[ 1.669 , -0.4386], 

[ 1.669 , -0.4386]]) 

In [61]: arr.repeat( [2, 3], axis=l) 

0ut[61] : 

array( [[ -2.0016, -2.0016, -0.3718, -0.3718, -0.3718], 

[ 1.669 , 1.669 , -0.4386, -0.4386, -0.4386]]) 

tile, on the other hand, is a shortcut for stacking copies of an array along an axis. 
Visually you can think of it as being akin to “laying down tiles”: 

In [62]: arr 
0ut[62] : 

array([[-2.0016, -0.3718], 

[ 1.669 , -0.4386]]) 

In [63]: np.tile(arr, 2) 

0ut[63] : 

array( [[ -2.0016, -0.3718, -2.0016, -0.3718], 

[ 1.669 , -0.4386, 1.669 , -0.4386]]) 

The second argument is the number of tiles; with a scalar, the tiling is made row by 
row, rather than column by column. The second argument to tile can be a tuple 
indicating the layout of the “tiling”: 

In [64]: arr 
0ut[64] : 

array( [[ -2.0016, -0.3718], 

[ 1.669 , -0.4386]]) 

In [65]: np.tile(arr, (2, 1)) 

0ut[65] : 

array( [[ -2.0016, -0.3718], 

[ 1.669 , -0.4386], 

[-2.0016, -0.3718], 

[ 1.669 , -0.4386]]) 

In [66]: np.tile(arr, (3, 2)) 

0ut[66] : 

array( [[ -2.0016, -0.3718, -2.0016, -0.3718], 

[ 1.669 , -0.4386, 1.669 , -0.4386], 

[-2.0016, -0.3718, -2.0016, -0.3718], 

[ 1.669 , -0.4386, 1.669 , -0.4386], 
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[-2.0016, -0.3718, -2.0016, -0.3718], 

[ 1.669 , -0.4386, 1.669 , -0.4386]]) 

Fancy Indexing Equivalents: take and put 

As you may recall from Chapter 4, one way to get and set subsets of arrays is by fancy 
indexing using integer arrays: 

In [67]: arr = np.arange(10) * 100 
In [68]: tnds = [7, 1, 2, 6] 

In [69]: arr[tnds] 

0ut[69] : array( [700, 100, 200, 600]) 

There are alternative ndarray methods that are useful in the special case of only mak¬ 
ing a selection on a single axis: 


In [70]: 
Out[70] : 

arr.take(lnds) 

array( [700, 100, 200, 600]) 






In [71]: 

arr.put(inds, 42) 






In [72]: 
0ut[72] : 

arr 

array( [ 0, 42, 42, 300, 400, 

500, 

42, 

42, 

800, 

900]) 

In [73]: 

arr.put(inds, [40, 41, 42, 43]) 






In [74]: 
0ut[74] : 

arr 

array( [ 0, 41, 42, 300, 400, 

500, 

43, 

40, 

800, 

900]) 


To use take along other axes, you can pass the axis keyword: 

In [75]: inds = [2, 0, 2, 1] 

In [76]: arr = np.random.randn(2, 4) 

In [77]: arr 
0ut[77] : 

array( [[ -0.5397, 0.477 , 3.2489, -1.0212], 

[-0.5771, 0.1241, 0.3026, 0.5238]]) 

In [78]: arr.take(inds, axis=l) 

0ut[78] : 

array( [[ 3.2489, -0.5397, 3.2489, 0.477 ], 

[ 0.3026, -0.5771, 0.3026, 0.1241]]) 

put does not accept an axis argument but rather indexes into the flattened (one¬ 
dimensional, C order) version of the array. Thus, when you need to set elements 
using an index array on other axes, it is often easiest to use fancy indexing. 
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A.3 Broadcasting 

Broadcasting describes how arithmetic works between arrays of different shapes. It 
can be a powerful feature, but one that can cause confusion, even for experienced 
users. The simplest example of broadcasting occurs when combining a scalar value 
with an array: 

In [79]: arr = np.arange(5) 

In [80]: arr 

Out[80]: array([0, 1, 2, 3, 4]) 

In [81]: arr * 4 

0ut[81] : array( [ 0, 4, 8, 12, 16]) 

Here we say that the scalar value 4 has been broadcast to all of the other elements in 
the multiplication operation. 

For example, we can demean each column of an array by subtracting the column 
means. In this case, it is very simple: 

In [82]: arr = np.random.randn(4, 3) 

In [83]: arr.mean(0) 

0ut[83] : array([-0.3928, -0.3824, -0.8768]) 

In [84]: demeaned = arr - arr.mean(0) 

In [85]: demeaned 
0ut[85] : 

array( [[ 0.3937, 1.7263, 0.1633], 

[-0.4384, -1.9878, -0.9839], 

[-0.468 , 0.9426, -0.3891], 

[ 0.5126, -0.6811, 1.2097]]) 

In [86]: demeaned.mean(0) 

0ut[86]: array([-0., 0., -0 ]) 

See Figure A-4 for an illustration of this operation. Demeaning the rows as a broad¬ 
cast operation requires a bit more care. Fortunately, broadcasting potentially lower 
dimensional values across any dimension of an array (like subtracting the row means 
from each column of a two-dimensional array) is possible as long as you follow the 
rules. 

This brings us to: 
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The Broadcasting Rule 

Two arrays are compatible for broadcasting if for each trailing dimension (i.e., starting 
from the end) the axis lengths match or if either of the lengths is 1. Broadcasting is 
then performed over the missing or length 1 dimensions. 
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Figure A-4. Broadcasting over axis 0 with a ID array 


Even as an experienced NumPy user, I often find myself having to pause and draw a 
diagram as I think about the broadcasting rule. Consider the last example and sup¬ 
pose we wished instead to subtract the mean value from each row. Since arr. mean(0) 
has length 3, it is compatible for broadcasting across axis 0 because the trailing 
dimension in arr is 3 and therefore matches. According to the rules, to subtract over 
axis 1 (i.e., subtract the row mean from each row), the smaller array must have shape 
(4, 1): 

In [87]: arr 

0ut[87] : 

array( [[ 0.0009, 1.3438, -0.7135], 

[-0.8312, -2.3702, -1.8608], 

[-0.8608, 0.5601, -1.2659], 

[ 0.1198, -1.0635, 0.3329]]) 

In [88]: row_means = arr.mean(l) 

In [89]: row_means . shape 

0ut[89] : (4,) 

In [90]: row_means.reshape((4, 1)) 

Out[90] : 

array([[ 0.2104], 

[-1.6874], 

[-0.5222], 

[-0.2036]]) 
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In [91]: demeaned = arr - row_means.reshape((4, 1)) 

In [92]: demeaned.mean(l) 

Out[92]: array([ 0., -0., 0., 0.]) 


See Figure A-5 for an illustration of this operation. 



Figure A-5. Broadcasting over axis 1 of a 2D array 


See Figure A-6 for another illustration, this time adding a two-dimensional array to a 
three-dimensional one across axis 0. 
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Figure A-6. Broadcasting over axis 0 of a 3D array 


Broadcasting Over Other Axes 

Broadcasting with higher dimensional arrays can seem even more mind-bending, but 
it is really a matter of following the rules. If you don’t, you’ll get an error like this: 
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In [93]: arr - arr.mean(l) 


ValueError Traceback (most recent call last) 

<lpython-input-93-7b87b85a20b2> in <module>() 

-> 1 arr - arr.mean(l) 

ValueError: operands could not be broadcast together with shapes (4,3) (4,) 

It’s quite common to want to perform an arithmetic operation with a lower dimen¬ 
sional array across axes other than axis 0. According to the broadcasting rule, the 
“broadcast dimensions” must be 1 in the smaller array. In the example of row 
demeaning shown here, this meant reshaping the row means to be shape (4, 1) 
instead of (4,): 

In [94]: arr - arr.mean(l).reshape((4, 1)) 

0ut[94] : 

array( [[ -0.2095, 1.1334, -0.9239], 

[ 0.8562, -0.6828, -0.1734], 

[-0.3386, 1.0823, -0.7438], 

[ 0.3234, -0.8599, 0.5365]]) 

In the three-dimensional case, broadcasting over any of the three dimensions is only 
a matter of reshaping the data to be shape-compatible. Figure A-7 nicely visualizes the 
shapes required to broadcast over each axis of a three-dimensional array. 

A common problem, therefore, is needing to add a new axis with length 1 specifically 
for broadcasting purposes. Using reshape is one option, but inserting an axis 
requires constructing a tuple indicating the new shape. This can often be a tedious 
exercise. Thus, NumPy arrays offer a special syntax for inserting new axes by index¬ 
ing. We use the special np. newaxis attribute along with “full” slices to insert the new 
axis: 


In [95]: arr = np.zeros((4, 4)) 

In [96]: arr_3d = arr[:, np.newaxis, :] 

In [97]: arr_3d.shape 
0ut[97] : (4, 1, 4) 

In [98]: arr_ld = np.random.normal(slze=3) 

In [99]: arr_ld[:, np. newaxis] 

0ut[99] : 

array( [[ -2.3594] , 

[-0.1995], 

[-1.542 ]]) 

In [100]: arr_ld[np. newaxis, :] 

Out[100] : array( [[ -2.3594, -0.1995, -1.542 ]]) 
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Full array shape: (8, 5, 3) Axis 2: (8, 5, 1) 



Figure A-7. Compatible 2D array shapes for broadcasting over a 3D array 

Thus, if we had a three-dimensional array and wanted to demean axis 2, say, we 
would need to write: 

In [101]: arr = np.random.randn(3, 4, 5) 

In [102]: depth_means = arr.mean(2) 

In [103]: depth_means 
Out[103] : 

array( [[ -0.4735, 0.3971, -0.0228, 0.2001], 

[-0.3521, -0.281 , -0.071 , -0.1586], 

[ 0.6245, 0.6047, 0.4396, -0.2846]]) 

In [104]: depth_means.shape 
Out[104] : (3, 4) 

In [105]: demeaned = arr - depth_means[ :, :, np.newaxis] 

In [106]: demeaned. mean(2) 

Out[106] : 


array( [[ 

0., 

0., 

-0., 

-0.1, 

[ 

0., 

0., 

-0., 

0.1, 

[ 

0., 

0., 

-0., 

-0.]]) 


You might be wondering if there’s a way to generalize demeaning over an axis without 
sacrificing performance. There is, but it requires some indexing gymnastics: 
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def demean_axis(arr, axis=0): 
means = arr.mean(axis) 

# This generalizes things like [:, np.newaxis] to N dimensions 
Indexer = [sllce(None)] * arr.ndlm 
indexer[axts] = np.newaxis 
return arr - meansflndexer] 


Setting Array Values by Broadcasting 

The same broadcasting rule governing arithmetic operations also applies to setting 
values via array indexing. In a simple case, we can do things like: 

In [107]: arr = np.zeros((4, 3)) 

In [108] : arr[ :] = 5 


In [109]: arr 
Out[109] : 


array( [[ 5., 

5., 

5.1, 

[ 5., 

5., 

5-1, 

[ 5., 

5., 

5.1, 

[ 5., 

5., 

5.]]) 


However, if we had a one-dimensional array of values we wanted to set into the col¬ 
umns of the array, we can do that as long as the shape is compatible: 

In [110]: col = np.array([1.28, -0.42, 0.44, 1.6]) 

In [111]: arr[:] = col[:, np.newaxis] 

In [112]: arr 
0ut[112]: 

array( [[ 1.28, 1.28, 1.28], 

[-0.42, -0.42, -0.42], 

[ 0.44, 0.44, 0.44], 

[ 1.6 , 1.6 , 1.6 ]]) 

In [113]: arr[:2] = [[-1.37], [0.509]] 


In [114]: arr 
0ut[114] : 
array( [ [-1.37 , 
[ 0.509, 
[ 0.44 , 
[ 1-6 , 


-1.37 , -1.37 ], 
0.509, 0.509], 

0.44 , 0.44 ], 

1.6 , 1.6 ]]) 
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A.4 Advanced ufunc Usage 

While many NumPy users will only make use of the fast element-wise operations pro¬ 
vided by the universal functions, there are a number of additional features that occa¬ 
sionally can help you write more concise code without loops. 

ufunc Instance Methods 

Each of NumPy’s binary ufuncs has special methods for performing certain kinds of 
special vectorized operations. These are summarized in Table A-2, but I’ll give a few 
concrete examples to illustrate how they work. 

reduce takes a single array and aggregates its values, optionally along an axis, by per¬ 
forming a sequence of binary operations. For example, an alternative way to sum ele¬ 
ments in an array is to use np. add. reduce: 

In [115]: arr = np.arange(10) 

In [116]: np.add.reduce(arr) 

0ut[116] : 45 

In [117]: arr.sun() 

0ut[117] : 45 

The starting value (0 for add) depends on the ufunc. If an axis is passed, the reduction 
is performed along that axis. This allows you to answer certain kinds of questions in a 
concise way. As a less trivial example, we can use np.logical_and to check whether 
the values in each row of an array are sorted: 

In [118]: np.random. seed(12346) # for reproducibility 

In [119]: arr = np.random.randn(5, 5) 

In [120]: arr[: :2] .sort(l) # sort a few rows 

In [121]: arr[:, :-l] < arr[:, 1:] 

0ut[121] : 

array([[ True, True, True, True], 

[False, True, False, False], 

[ True, True, True, True], 

[ True, False, True, True], 

[ True, True, True, True]], dtype=bool) 

In [122]: np.logical_and.reduce(arr[ :, :-l] < arr[:, 1:], axis=l) 

0ut[122]: array([ True, False, True, False, True], dtype=bool) 

Note that logical_and. reduce is equivalent to the all method. 

accumulate is related to reduce like cumsum is related to sum. It produces an array of 
the same size with the intermediate “accumulated” values: 
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In [123]: arr = np.arange(15).reshape((3, 5)) 


In [124]: np.add.accumulate(arr, axts=l) 

0ut[124] : 

array( [[ 0, 1, 3, 6, 10], 

[ 5, 11, 18, 26, 35], 

[10, 21, 33, 46, 60]]) 

outer performs a pairwise cross-product between two arrays: 

In [125]: arr = np.arange(3).repeat( [1, 2, 2]) 

In [126]: arr 

0ut[126] : array( [0, 1, 1, 2, 2]) 

In [127]: np.multiply.outer(arr, np.arange(5)) 

0ut[127] : 

array([[0, 0, 0, 0, 0], 

[0, 1, 2, 3, 4], 

[0, 1, 2, 3, 4], 

[0, 2, 4, 6, 8], 

[0, 2, 4, 6, 8]]) 

The output of outer will have a dimension that is the sum of the dimensions of the 
inputs: 

In [128]: x, y = np.random.randn(3, 4), np.random.randn(5) 

In [129]: result = np.subtract.outer(x, y) 

In [130]: result.shape 
Out[130] : (3, 4, 5) 

The last method, reduceat, performs a “local reduce,” in essence an array groupby 
operation in which slices of the array are aggregated together. It accepts a sequence of 
“bin edges” that indicate how to split and aggregate the values: 

In [131]: arr = np.arange(10) 

In [132]: np.add.reduceat(arr, [0, 5, 8]) 

0ut[132] : array( [10, 18, 17]) 

The results are the reductions (here, sums) performed over arr[0:5], arr[5:8], and 
arr [8: ]. As with the other methods, you can pass an axis argument: 

In [133]: arr = np.multiply.outer(np.arange(4) , np.arange(5)) 


In [134]: arr 
0ut[134] : 


array( [[ 

0, 

0, 

0, 

0, 

8], 

[ 

0, 

1, 

2, 

3, 

4], 

[ 

0, 

2, 

4, 

6, 

8], 

[ 

0, 

3, 

6, 

9, 

12]]) 
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In [135]: np.add.reduceatfarr, [0, 2, 4], axis=l) 
Out[135] : 


array( [[ 

0, 

0, 

0], 

[ 

1, 

5, 

4], 

[ 

2, 

10, 

8], 

[ 

3, 

15, 

12]]) 


See Table A-2 for a partial listing of ufunc methods. 


Table A-2. ufunc methods 


1 Method 

Description 1 

reduce(x) 

accumulate(x) 

reduceat(x, bins) 

outer(x, y) 

Aggregate values by successive applications of the operation 

Aggregate values, preserving all partial aggregates 

"Local" reduce or "group by"; reduce contiguous slices of data to produce aggregated array 

Apply operation to all pairs of elements in x and y; the resulting array has shape x. shape + 
y.shape 


Writing New ufuncs in Python 

There are a number of facilities for creating your own NumPy ufuncs. The most gen¬ 
eral is to use the NumPy C API, but that is beyond the scope of this book. In this 
section, we will look at pure Python ufuncs. 

numpy. f rompyf unc accepts a Python function along with a specification for the num¬ 
ber of inputs and outputs. For example, a simple function that adds element-wise 
would be specified as: 

In [136]: def add_elements(x, y): 

.: return x + y 

In [137]: add_them = np.frompyfunc(add_elements, 2, 1) 

In [138]: add_then(np.arange(8), np.arange(8)) 

0ut[138]: array([0, 2, 4, 6, 8, 10, 12, 14], dtype=object) 

Functions created using frompyfunc always return arrays of Python objects, which 
can be inconvenient. Fortunately, there is an alternative (but slightly less featureful) 
function, numpy. vectorize, that allows you to specify the output type: 

In [139]: add_them = np.vectorize(add_elements, otypes=[np.float64] ) 

In [140]: add_them(np.arange(8), np.arange(8)) 

Out[140]: array( [ 0., 2., 4., 6., 8., 10., 12., 14.]) 

These functions provide a way to create ufunc-like functions, but they are very slow 
because they require a Python function call to compute each element, which is a lot 
slower than NumPy’s C-based ufunc loops: 
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In [141]: arr = np.random. randn(10000) 


In [142]: %timelt add_them(arr, arr) 

4.12 ms +- 182 us per loop (mean +- std. dev. of 7 runs, 100 loops each) 

In [143]: %timelt np.add(arr, arr) 

6.89 us +- 504 ns per loop (mean +- std. dev. of 7 runs, 100000 loops each) 

Later in this chapter we’ll show how to create fast ufuncs in Python using the Numba 
project. 

A.5 Structured and Record Arrays 

You may have noticed up until now that ndarray is a homogeneous data container; 
that is, it represents a block of memory in which each element takes up the same 
number of bytes, determined by the dtype. On the surface, this would appear to not 
allow you to represent heterogeneous or tabular-like data. A structured array is an 
ndarray in which each element can be thought of as representing a struct in C (hence 
the “structured” name) or a row in a SQL table with multiple named fields: 

In [144]: dtype = [('x', np.float64), (' y' , np.int32)] 

In [145]: sarr = np.array( [(1 . 5, 6), (np.pl, -2)], dtype=dtype) 

In [146]: sarr 
0ut[146] : 

array([( 1.5 , 6), ( 3.1416, -2)], 

dtype=[( 'x' , ' <f8' ), ( 'y 1 , '<14’)]) 

There are several ways to specify a structured dtype (see the online NumPy documen¬ 
tation). One typical way is as a list of tuples with (field_name, field_data_type). 
Now, the elements of the array are tuple-like objects whose elements can be accessed 
like a dictionary: 

In [147]: sarr[0] 

0ut[147] : ( 1.5, 6) 

In [148]: sarr[0][’y'] 

0ut[148] : 6 

The field names are stored in the dtype. names attribute. When you access a field on 
the structured array, a strided view on the data is returned, thus copying nothing: 

In [149]: sarr['x'] 

0ut[149] : array( [ 1.5 , 3.1416]) 

Nested dtypes and Multidimensional Fields 

When specifying a structured dtype, you can additionally pass a shape (as an int or 
tuple): 
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In [150]: dtype = [('x', np.int64, 3), ('y', np.int32)] 


In [151]: arr = np.zeros(4, dtype=dtype) 

In [152]: arr 
Out[152]: 

array([([0, 0, 0], 0), ([0, 0, 0], 0), ([0, 0, 0], 0), ([0, 0, 0], 0)], 
dtype=[('x\ ' <18 ' , (3,)), ('y'» '<14')]) 

In this case, the x field now refers to an array of length 3 for each record: 

In [153]: arr [0][ 'x' ] 

0ut[153]: array([0, 0, 0]) 

Conveniently, accessing arr[ ' x' ] then returns a two-dimensional array instead of a 
one-dimensional array as in prior examples: 

In [154]: arr['x'] 

0ut[154]: 
array( [[0, 0, 0], 

[ 0 , 0 , 0 ], 

[ 0 , 0 , 0 ], 

[ 0 , 0 , 0 ]]) 

This enables you to express more complicated, nested structures as a single block of 
memory in an array. You can also nest dtypes to make more complex structures. Here 
is an example: 

In [155]: dtype = [ ( ' x' , [('a 1 , 'f8'), ( ' b' , ' f 4' ) ]), ('y\ np.int32)] 

In [156]: data = np.array([((l, 2), 5), ((3, 4), 6)], dtype=dtype) 

In [157]: data['x'] 

0ut[157]: 

array([( 1 . , 2.), ( 3., 4.)], 

dtype=[( 1 a' , ' <f8' ), ('b', ' <f4' )] ) 

In [158]: data['y'] 

0ut[158]: array([5, 6], dtype=int32) 

In [159]: data[ 'x' ][’ a' ] 

0ut[159]: array([ 1., 3.]) 

pandas DataFrame does not support this feature directly, though it is similar to hier¬ 
archical indexing. 

Why Use Structured Arrays? 

Compared with, say, a pandas DataFrame, NumPy structured arrays are a compara¬ 
tively low-level tool. They provide a means to interpreting a block of memory as a 
tabular structure with arbitrarily complex nested columns. Since each element in the 
array is represented in memory as a fixed number of bytes, structured arrays provide 
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a very fast and efficient way of writing data to and from disk (including memory 
maps), transporting it over the network, and other such uses. 

As another common use for structured arrays, writing data files as fixed-length 
record byte streams is a common way to serialize data in C and C++ code, which is 
commonly found in legacy systems in industry. As long as the format of the file is 
known (the size of each record and the order, byte size, and data type of each ele¬ 
ment), the data can be read into memory with np. f romf lie. Specialized uses like this 
are beyond the scope of this book, but it’s worth knowing that such things are 
possible. 

A.6 More About Sorting 

Like Python’s built-in list, the ndarray sort instance method is an in-place sort, 
meaning that the array contents are rearranged without producing a new array: 

In [160]: arr = np.random.randn(6) 

In [161]: arr.sort() 

In [162]: arr 

0ut[162] : array( [ - 1.082 , 0.3759, 0.8014, 1.1397, 1.2888, 1.8413]) 

When sorting arrays in-place, remember that if the array is a view on a different 
ndarray, the original array will be modified: 

In [163]: arr = np.random.randn(3, 5) 

In [164]: arr 
0ut[164] : 

array( [[ -0.3318, -1.4711, 0.8705, -0.0847, -1.1329], 

[-1.0111, -0.3436, 2.1714, 0.1234, -0.0189], 

[ 0.1773, 0.7424, 0.8548, 1.038 , -0.329 ]]) 

In [165]: arr[:, 0].sort() # Sort first column values in-place 

In [166]: arr 
0ut[166] : 

array( [[ -1.0111, -1.4711, 0.8705, -0.0847, -1.1329], 

[-0.3318, -0.3436, 2.1714, 0.1234, -0.0189], 

[ 0.1773, 0.7424, 0.8548, 1.038 , -0.329 ]]) 

On the other hand, numpy.sort creates a new, sorted copy of an array. Otherwise, it 
accepts the same arguments (such as kind) as ndarray. sort: 

In [167]: arr = np.random.randn(5) 

In [168]: arr 

0ut[168] : array( [ -1. 1181, -0.2415, -2.0051, 0.7379, -1.0614]) 
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In [169]: np.sort(arr) 

0ut[169] : array( [ 2.0051, -1.1181, -1.0614, -0.2415, 0.7379]) 


In [170]: arr 

Out[170]: array( [- 1. 1181, -0.2415, -2.0051, 0.7379, -1.0614]) 

All of these sort methods take an axis argument for sorting the sections of data along 
the passed axis independently: 

In [171]: arr = np.random.randn(3, 5) 


In [172]: arr 
0ut[172]: 
array([[ 0.5955, 
[-0.3215, 
[ 0.3969, 


-0.2682, 1.3389, 
1.0054, -0.5168, 
-1.7638, 0.6071, 


-0.1872, 0.9111], 

1.1925, -0.1989], 
-0.2222, -0.2171]]) 


In [173]: arr. sort(axis=l) 


In [174]: arr 
0ut[174] : 

array( [[ -0.2682, -0.1872, 0.5955, 0.9111, 1.3389], 

[-0.5168, -0.3215, -0.1989, 1.0054, 1.1925], 

[-1.7638, -0.2222, -0.2171, 0.3969, 0.6071]]) 

You may notice that none of the sort methods have an option to sort in descending 
order. This is a problem in practice because array slicing produces views, thus not 
producing a copy or requiring any computational work. Many Python users are 
familiar with the “trick” that for a list values, values [:: -1] returns a list in reverse 
order. The same is true for ndarrays: 


In [175]: arr[:. 

::-l] 

0ut[175] : 


array([[ 1.3389, 

0.9111 

[ 1.1925, 

1.0054 

[ 0.6071, 

0.3969 


0.5955, -0.1872, -0.2682], 
-0.1989, -0.3215, -0.5168], 
-0.2171, -0.2222, -1.7638]]) 


Indirect Sorts: argsort and lexsort 

In data analysis you may need to reorder datasets by one or more keys. For example, a 
table of data about some students might need to be sorted by last name, then by first 
name. This is an example of an indirect sort, and if you’ve read the pandas-related 
chapters you have already seen many higher-level examples. Given a key or keys (an 
array of values or multiple arrays of values), you wish to obtain an array of integer 
indices (I refer to them colloquially as indexers) that tells you how to reorder the data 
to be in sorted order. Two methods for this are argsort and numpy.lexsort. As an 
example: 

In [176]: values = np.array([5, 0, 1, 3, 2]) 

In [177]: Indexer = values.argsort( ) 
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In [178]: indexer 

Out[178]: array([l, 2, 4, 3 , 0]) 


In [179]: values[indexer] 

Out[179] : array( [0, 1 , 2 , 3, 5]) 

As a more complicated example, this code reorders a two-dimensional array by its 
first row: 


In [180]: arr = np.random.randn(3, 5) 
In [181]: arr[0] = values 


In [182]: arr 
0ut[182]: 

array([[ 5. ,0. ,1. ,3. , 2. ], 

[-0.3636, -0.1378, 2.1777, -0.4728, 0.8356], 
[-0.2089, 0.2316, 0.728 , -1.3918, 1.9956]]) 


In [183]: arr[:, arr[0] .argsortQ] 

0ut[183]: 

array( [[ 0. ,1. ,2. ,3. , 5. ], 

[-0.1378, 2.1777, 0.8356, -0.4728, -0.3636], 

[ 0.2316, 0.728 , 1.9956, -1.3918, -0.2089]]) 


lexsort is similar to argsort, but it performs an indirect lexicographical sort on multi¬ 
ple key arrays. Suppose we wanted to sort some data identified by first and last 
names: 


In [184]: first_name = np.array( [' Bob' , 'lane', 'Steve 1 , 'Bill', 'Barbara']) 


In [185]: last_name = np.array( [' Jones' , 'Arnold', 'Arnold', 'Jones', 'Walters']) 


In [186]: sorter = np.lexsort((first_name, last_name)) 


In [187]: sorter 

0ut[187]: array([l, 2, 3, 0, 4]) 


In [188]: zip(last_name[sorter] , first_name[sorter] ) 

0ut[188]: <zip at 0x7fa203edalc8> 

lexsort can be a bit confusing the first time you use it because the order in which the 
keys are used to order the data starts with the last array passed. Here, last_name was 
used before first name. 



pandas methods like Series’s and DataFrame’s sort_values method 
are implemented with variants of these functions (which also must 
take into account missing values). 
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Alternative Sort Algorithms 

A stable sorting algorithm preserves the relative position of equal elements. This can 
be especially important in indirect sorts where the relative ordering is meaningful: 

In [189]: values = np.array([ '2:first' , '2:second', 'l:first', 'l:second', 

.: '1:third' ]) 

In [190]: key = np.array([2, 2, 1, 1, 1]) 

In [191]: Indexer = key.argsort(kind= 'mergesort 1 ) 

In [192]: Indexer 

0ut[192]: array([2, 3, 4, 0, 1]) 

In [193]: values.take(indexer) 

0ut[193]: 

array([ 1 1:first' , '1:second', 'l:third', '2:flrst', '2:second'], 
dtype=' <U8' ) 

The only stable sort available is mergesort, which has guaranteed 0(n log n) perfor¬ 
mance (for complexity buffs), but its performance is on average worse than the 
default quicksort method. See Table A-3 for a summary of available methods and 
their relative performance (and performance guarantees). This is not something that 
most users will ever have to think about, but it’s useful to know that it’s there. 


Table A-3. Array sorting methods 


1 Kind 

Speed 

Stable 

Work space 

Worst case | 

'quicksort' 

1 

No 

0 

0(n A 2) 

' mergesort ' 

2 

Yes 

n / 2 

0(n log n) 

'heapsort ' 

3 

No 

0 

0(n log n) 


Partially Sorting Arrays 

One of the goals of sorting can be to determine the largest or smallest elements in an 
array. NumPy has optimized methods, numpy.partition and np.argpartition, for 
partitioning an array around the k-th smallest element: 

In [194]: np.random. seed(12345) 

In [195]: arr = np.random. randn(20) 

In [196]: arr 
0ut[196] : 

array([-0.2047, 0.4789, -0.5194, -0.5557, 1.9658, 1.3934, 0.0929, 

0.2817, 0.769 , 1.2464, 1.0072, -1.2962, 0.275 , 0.2289, 

1.3529, 0.8864, -2.0016, -0.3718, 1.669 , -0.4386]) 

In [197]: np.partition(arr, 3) 
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Out [ 197 ]: 

array([-2.0016, -1.2962, -0.5557, -0.5194, -0.3718, -0.4386, -0.2047, 

0.2817, 0.769 , 0.4789, 1.0072, 0.0929, 0.275 , 0.2289, 

1.3529, 0.8864, 1.3934, 1.9658, 1.669 , 1.2464]) 

After you call partition(arr, 3), the first three elements in the result are the small¬ 
est three values in no particular order, numpy.argpartition, similar to numpy.arg 
sort, returns the indices that rearrange the data into the equivalent order: 

In [198]: indices = np.argpartition(arr, 3) 

In [199]: indices 
0ut[199] : 

array( [16, 11, 3, 2, 17, 19, 0, 7, 8, 1, 10, 6, 12, 13, 14, 15, 5, 

4, 18, 9]) 

In [200]: arr.take(indices) 

Out[200] : 

array([-2.0016, -1.2962, -0.5557, -0.5194, -0.3718, -0.4386, -0.2047, 

0.2817, 0.769 , 0.4789, 1.0072, 0.0929, 0.275 , 0.2289, 

1.3529, 0.8864, 1.3934, 1.9658, 1.669 , 1.2464]) 

numpy.searchsorted: Finding Elements in a Sorted Array 

searchsorted is an array method that performs a binary search on a sorted array, 
returning the location in the array where the value would need to be inserted to 
maintain sortedness: 

In [201]: arr = np.array([0, 1, 7, 12, 15]) 

In [202]: arr.searchsorted(9) 

Out[202] : 3 

You can also pass an array of values to get an array of indices back: 

In [203]: arr.searchsorted( [0, 8, 11, 16]) 

Out[203] : array( [0, 3, 3, 5]) 

You might have noticed that searchsorted returned 0 for the 0 element. This is 
because the default behavior is to return the index at the left side of a group of equal 
values: 

In [204]: arr = np.array([0, 0, 0, 1, 1, 1, 1]) 

In [205]: arr.searchsorted( [0, 1]) 

Out[205]: array([0, 3]) 

In [206]: arr.searchsorted( [0, 1], side=' right' ) 

Out[206]: array([3, 7]) 
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As another application of searchsorted, suppose we had an array of values between 
0 and 10,000, and a separate array of “bucket edges” that we wanted to use to bin the 
data: 

In [207]: data = np.floor(np.random.untforn(0, 10000, size=50)) 


In [208]: bins 

In [209]: data 
Out[209] : 

= np.array([0. 

100, 1000, 

5000, 

10000]) 



array([ 9940., 

6768. , 

7908. : 

, 1709., 

268. , 

8003., 

9037., 

246 

4917. , 

5262. , 

5963., 

, 519., 

8950. , 

7282. , 

8183. , 

5002 

8101. , 

959. , 

2189. : 

, 2587., 

4681. , 

4593., 

7095. , 

1780 

5314., 

1677. , 

7688. : 

, 9281., 

6094. , 

1501., 

4896., 

3773 

8486., 

9110. , 

3838. : 

, 3154., 

5683. , 

1878. , 

1258., 

6875 

7996. , 

5039. , 

5735., 9732.. 

2256.]) 

, 6340., 

8884. , 

4954. , 

3516. , 

7142 


To then get a labeling of which interval each data point belongs to (where 1 would 
mean the bucket [0, 100)), we can simply use searchsorted: 

In [210]: labels = bins.searchsorted(data) 

In [211]: labels 
0ut[211] : 

array( [4, 4, 4, 3, 2, 4, 4, 2, 3, 4, 4, 2, 4, 4, 4, 4, 4, 2, 3, 3, 3, 3, 4, 

3, 4, 3, 4, 4, 4, 3, 3, 3, 4, 4, 3, 3, 4, 3, 3, 4, 4, 4, 4, 4, 4, 3, 

3, 4, 4, 3]) 

This, combined with pandas’s groupby, can be used to bin data: 

In [212]: pd.Series(data).groupby(labels).mean() 

0ut[212] : 

2 498.000000 

3 3064.277778 

4 7389.035714 
dtype: float64 

A.7 Writing Fast NumPy Functions with Numba 

Numba is an open source project that creates fast functions for NumPy-like data 
using CPUs, GPUs, or other hardware. It uses the LLVM Project to translate Python 
code into compiled machine code. 

To introduce Numba, let’s consider a pure Python function that computes the expres¬ 
sion (x - y) .mean() using a for loop: 

import numpy as np 

def mean_dtstance(x, y): 
nx = len(x) 
result = 0.0 


476 | Appendix A: Advanced NumPy 



count = 0 

for i in range(nx) : 

result += x[i] - y[i] 
count += 1 

return result / count 

This function is very slow: 

In [209]: x = np.random.randn( 10000000) 

In [210]: y = np.random. randn(10000000) 

In [211]: %timeit mean_dlstance(x, y) 

1 loop, best of 3: 2s per loop 

In [212]: %timeit (x - y).mean() 

100 loops, best of 3: 14.7 ms per loop 

The NumPy version is over 100 times faster. We can turn this function into a com¬ 
piled Numba function using the numba. jit function: 

In [213]: import as 

In [214]: numba_mean_distance = nb.jit(mean_distance) 

We could also have written this as a decorator: 

@nb.jit 

def mean_distance(x, y): 
nx = len(x) 
result = 0.0 
count = 0 

for i in range(nx): 

result += x[i] - y[i] 
count += 1 

return result / count 

The resulting function is actually faster than the vectorized NumPy version: 

In [215]: %timeit numba_mean_distance(x, y) 

100 loops, best of 3: 10.3 ms per loop 

Numba cannot compile arbitrary Python code, but it supports a significant subset of 
pure Python that is most useful for writing numerical algorithms. 

Numba is a deep library, supporting different kinds of hardware, modes of compila¬ 
tion, and user extensions. It is also able to compile a substantial subset of the NumPy 
Python API without explicit for loops. Numba is able to recognize constructs that 
can be compiled to machine code, while substituting calls to the CPython API for 
functions that it does not know how to compile. Numba’s jit function has an option, 
nopython=True, which restricts allowed code to Python code that can be compiled to 
LLVM without any Python C API calls. jit(nopython=True) has a shorter alias 
numba.njit. 
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In the previous example, we could have written: 
from import float64, njlt 

(ftoat64(float64[ :], float64[ :])) 
def mean_distance(x, y): 
return (x - y).mean() 

I encourage you to learn more by reading the online documentation for Numba. The 
next section shows an example of creating custom NumPy ufunc objects. 

Creating Custom numpy.ufunc Objects with Numba 

The numba.vectorize function creates compiled NumPy ufuncs, which behave like 
built-in ufuncs. Let’s consider a Python implementation of numpy. add: 

from import vectorize 

@vectorize 

def nb_add(x, y): 

return x + y 

Now we have: 

In [13]: x = np.arange(10) 

In [14]: nb_add(x, x) 


0ut[14]: array([ 0., 2., 4., 

6., 

8., 

10. , 

12., 

14., 

16., 

18.]) 

In [15]: nb_add.accumulated, 0) 
0ut[15]: array([ 0., 1., 3., 

6., 

10., 

15. , 

21., 

28., 

36., 

45.]) 


A.8 Advanced Array Input and Output 

In Chapter 4, we became acquainted with np.save and np.load for storing arrays in 
binary format on disk. There are a number of additional options to consider for more 
sophisticated use. In particular, memory maps have the additional benefit of enabling 
you to work with datasets that do not fit into RAM. 

Memory-Mapped Files 

A memory-mapped file is a method for interacting with binary data on disk as though 
it is stored in an in-memory array. NumPy implements a memmap object that is 
ndarray-like, enabling small segments of a large file to be read and written without 
reading the whole array into memory. Additionally, a memmap has the same methods 
as an in-memory array and thus can be substituted into many algorithms where an 
ndarray would be expected. 
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To create a new memory map, use the function np. Remap and pass a file path, dtype, 
shape, and file mode: 

In [214]: map = np.memmap( 'mymmap' , dtype='float64' , mode='w+', 

.: shape=(10000, 10000)) 


In [215]: map 
0ut[215] : 


memmapC [[ 9-, 

0., 

0. , ., 

0., 

0., 

0-1, 

[ 0-, 

0., 

0. , ., 

.., 0., 

0., 

0-1, 

[ 0-, 

0., 

0. , ., 

0-, 

0., 

0-1, 

[ o’. 

0., 

0. , ., 

0., 

0., 

0-1, 

[ 0-, 

0., 

0. , ., 

0-, 

0., 

0-1, 

[ 0-, 

0., 

0. , ., 

.., 0., 

0., 

0.11) 


Slicing a memap returns views on the data on disk: 

In [216]: section = map[:5] 

If you assign data to these, it will be buffered in memory (like a Python file object), 
but you can write it to disk by calling flush: 

In [217]: section[:] = np.random.randn(5, 10000) 

In [218]: mmap.flush() 

In [219]: mmap 
0ut[219] : 

memmapC [[ 0.7584, -0.6605, 0.8626, ..., 0.6046, -0.6212, 2.0542], 

[-1.2113, -1.0375, 0.7093, ..., -1.4117, -0.1719, -0.8957], 

[-0.1419, -0.3375, 0.4329, ..., 1.2914, -0.752 , -0.44 ], 


[ 0 - 

, 0 . 

, 0 . 

..., 0 . 

, 0 . 

, 0 . 

1 , 

[ 0 - 

, 0 . 

, 0 . 

..., 0 . 

, 0 . 

, 0 . 

1 , 

[ 0 - 

, 0 . 

, 0 . 

..., 0 . 

, 0 . 

, 0 . 

11 ) 


In [220] : del mmap 

Whenever a memory map falls out of scope and is garbage-collected, any changes will 
be flushed to disk also. When opening an existing memory map, you still have to spec¬ 
ify the dtype and shape, as the file is only a block of binary data with no metadata on 
disk: 

In [221]: mmap = np.memmap( 'mymmap' , dtype='float64' , shape=(10000, 10000)) 


In [222]: mmap 
0ut[222] : 


memmap([[ 0.7584, -0.6605, 

0.8626, .. 

., 0.6046, -0.6212, 

2.0542], 

[-1.2113, -1.0375, 

0.7093, .. 

., -1.4117, 

-0.1719, 

-0.8957], 

[-0.1419, -0.3375, 

0.4329, .. 

., 1.2914, -0.752 , 

-0.44 ], 

• • • > 

[0. ,0. 

0. 

., 0. 

0. 

0. 1, 
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[ 0 . , 0 . , 0 . , .... 0 . , 0 . , 0 . ], 

[ 0 . , 0 . , 0 . , .... 0 . , 0 . , 0 . ]]) 

Memory maps also work with structured or nested dtypes as described in a previous 
section. 

HDF5 and Other Array Storage Options 

PyTables and h5py are two Python projects providing NumPy-friendly interfaces for 
storing array data in the efficient and compressible HDF5 format (HDF stands for 
hierarchical data format). You can safely store hundreds of gigabytes or even terabytes 
of data in HDF5 format. To learn more about using HDF5 with Python, I recommend 
reading the pandas online documentation. 

A.9 Performance Tips 

Getting good performance out of code utilizing NumPy is often straightforward, as 
array operations typically replace otherwise comparatively extremely slow pure 
Python loops. The following list briefly summarizes some things to keep in mind: 

• Convert Python loops and conditional logic to array operations and boolean 
array operations 

• Use broadcasting whenever possible 

• Use arrays views (slicing) to avoid copying data 

• Utilize ufuncs and ufunc methods 

If you can’t get the performance you require after exhausting the capabilities provided 
by NumPy alone, consider writing code in C, Fortran, or Cython. I use Cython fre¬ 
quently in my own work as an easy way to get C-like performance with minimal 
development. 

The Importance of Contiguous Memory 

While the full extent of this topic is a bit outside the scope of this book, in some 
applications the memory layout of an array can significantly affect the speed of com¬ 
putations. This is based partly on performance differences having to do with the 
cache hierarchy of the CPU; operations accessing contiguous blocks of memory (e.g., 
summing the rows of a C order array) will generally be the fastest because the mem¬ 
ory subsystem will buffer the appropriate blocks of memory into the ultrafast LI or 
L2 CPU cache. Also, certain code paths inside NumPy’s C codebase have been opti¬ 
mized for the contiguous case in which generic strided memory access can be 
avoided. 


480 | Appendix A: Advanced NumPy 



To say that an array’s memory layout is contiguous means that the elements are stored 
in memory in the order that they appear in the array with respect to Fortran (column 
major) or C (row major) ordering. By default, NumPy arrays are created as C- 
contiguous or just simply contiguous. A column major array, such as the transpose of 
a C-contiguous array, is thus said to be Fortran-contiguous. These properties can be 
explicitly checked via the flags attribute on the ndarray: 

In [225]: arr_c = np.ones((1000, 1000), order='C') 

In [226]: arr_f = np.ones((1000, 1000), order='F') 

In [227]: arr_c.flags 
0ut[227] : 

C_CONTIGUOUS : True 
F_CONTIGUOUS : False 
OWNDATA : True 
WRITEABLE : True 
ALIGNED : True 
UPDATEIFCOPY : False 

In [228]: arr_f.flags 
0ut[228] : 

C_CONTIGUOUS : False 
F_CONTIGUOUS : True 
OWNDATA : True 
WRITEABLE : True 
ALIGNED : True 
UPDATEIFCOPY : False 

In [229]: arr_f.flags.f_contlguous 
0ut[229] : True 

In this example, summing the rows of these arrays should, in theory, be faster for 
arr_c than arr_f since the rows are contiguous in memory. Here I check for sure 
using %timeit in IPython: 

In [230]: %tlneit arr_c.sum(l) 

784 us +- 10.4 us per loop (mean +- std. dev. of 7 runs, 1000 loops each) 

In [231]: %tlmeit arr_f.sum(l) 

934 us +- 29 us per loop (mean +- std. dev. of 7 runs, 1000 loops each) 

When you’re looking to squeeze more performance out of NumPy, this is often a 
place to invest some effort. If you have an array that does not have the desired mem¬ 
ory order, you can use copy and pass either ' C 1 or ' F': 

In [232]: arr_f.copy( 'C' ) .flags 
0ut[232] : 

C_CONTIGUOUS : True 
F_CONTIGUOUS : False 
OWNDATA : True 
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WRITEABLE : True 
ALIGNED : True 
UPDATEIFCOPY : False 

When constructing a view on an array, keep in mind that the result is not guaranteed 
to be contiguous: 

In [233]: arr_c[ : 50]. flags.contiguous 
0ut[233] : True 

In [234]: arr_c[:, :50], flags 
0ut[234]: 

C_CONTIGUOUS : False 
F_CONTIGUOUS : False 
OWNDATA : False 
WRITEABLE : True 
ALIGNED : True 
UPDATEIFCOPY : False 
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APPENDIX B 


More on the I Python System 


In Chapter 2 we looked at the basics of using the IPython shell and Jupyter notebook. 
In this chapter, we explore some deeper functionality in the IPython system that can 
either be used from the console or within Jupyter. 

B.1 Using the Command History 

IPython maintains a small on-disk database containing the text of each command 
that you execute. This serves various purposes: 

• Searching, completing, and executing previously executed commands with mini¬ 
mal typing 

• Persisting the command history between sessions 

• Logging the input/output history to a file 

These features are more useful in the shell than in the notebook, since the notebook 
by design keeps a log of the input and output in each code cell. 

Searching and Reusing the Command History 

The IPython shell lets you search and execute previous code or other commands. 
This is useful, as you may often find yourself repeating the same commands, such as a 
%run command or some other code snippet. Suppose you had run: 

In[7]: %run first/second/third/data_script.py 

and then explored the results of the script (assuming it ran successfully) only to find 
that you made an incorrect calculation. After figuring out the problem and modifying 
data_script.py, you can start typing a few letters of the %run command and then press 
either the Ctrl-P key combination or the up arrow key. This will search the command 
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history for the first prior command matching the letters you typed. Pressing either 
Ctrl-P or the up arrow key multiple times will continue to search through the history. 
If you pass over the command you wish to execute, fear not. You can move forward 
through the command history by pressing either Ctrl-N or the down arrow key. After 
doing this a few times, you may start pressing these keys without thinking! 

Using Ctrl-R gives you the same partial incremental searching capability provided by 
the readline used in Unix-style shells, such as the bash shell. On Windows, readline 
functionality is emulated by IPython. To use this, press Ctrl-R and then type a few 
characters contained in the input line you want to search for: 

In [1]: a_command = foo(x, y, z) 

(reverse-i-search)'com': a_command = foo(x, y, z) 

Pressing Ctrl-R will cycle through the history for each line matching the characters 
you’ve typed. 

Input and Output Variables 

Forgetting to assign the result of a function call to a variable can be very annoying. 
An IPython session stores references to both the input commands and output Python 
objects in special variables. The previous two outputs are stored in the _ (one under¬ 
score) and_(two underscores) variables, respectively: 

In [24]: 2 ** 27 
0ut[24] : 134217728 

In [25]: _ 

0ut[25] : 134217728 

Input variables are stored in variables named like _iX, where X is the input line num¬ 
ber. For each input variable there is a corresponding output variable _X. So after input 
line 27, say, there will be two new variables _27 (for the output) and _i27 for the 
input: 

In [26]: foo = 'bar' 

In [27]: foo 
0ut[27]: 'bar' 

In [28]: _L27 
0ut[28]: u'foo' 

In [29]: _27 
0ut[29]: 'bar' 
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Since the input variables are strings they can be executed again with the Python exec 
keyword: 

In [30]: exec(_127) 

Here _i27 refers to the code input in In [27]. 

Several magic functions allow you to work with the input and output history. %hist is 
capable of printing all or part of the input history, with or without line numbers. 
%reset is for clearing the interactive namespace and optionally the input and output 
caches. The %xdel magic function is intended for removing all references to a. particu¬ 
lar object from the IPython machinery. See the documentation for both of these mag¬ 
ics for more details. 



When working with very large datasets, keep in mind that IPy- 
thon’s input and output history causes any object referenced there 
to not be garbage-collected (freeing up the memory), even if you 
delete the variables from the interactive namespace using the del 
keyword. In such cases, careful usage of %xdel and %reset can help 
you avoid running into memory problems. 


B.2 Interacting with the Operating System 

Another feature of IPython is that it allows you to seamlessly access the filesystem 
and operating system shell. This means, among other things, that you can perform 
most standard command-line actions as you would in the Windows or Unix (Linux, 
macOS) shell without having to exit IPython. This includes shell commands, chang¬ 
ing directories, and storing the results of a command in a Python object (list or 
string). There are also simple command aliasing and directory bookmarking features. 

See Table B-l for a summary of magic functions and syntax for calling shell com¬ 
mands. I’ll briefly visit these features in the next few sections. 


Table B-l. IPython system-related commands 


Command Description 


lend 

output = lend args 

%alias alias_nane end 

%bookmark 

%cd directory 

%pwd 

%pushd directory 

%popd 

%dlrs 


Execute cmd in the system shell 

Run cmd and store the stdout in output 

Define an alias for a system (shell) command 

Utilize IPython's directory bookmarking system 

Change system working directory to passed directory 

Return the current system working directory 

Place current directory on stack and change to target directory 

Change to directory popped off the top of the stack 

Return a list containing the current directory stack 
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1 Command 

Description | 

%dhist 

Print the history of visited directories 

%env 

Return the system environment variables as a diet 

%matplotlib 

Configure matplotlib integration options 


Shell Commands and Aliases 

Starting a line in IPython with an exclamation point !, or bang, tells IPython to exe¬ 
cute everything after the bang in the system shell. This means that you can delete files 
(using rm or del, depending on your OS), change directories, or execute any other 
process. 

You can store the console output of a shell command in a variable by assigning the 
expression escaped with ! to a variable. For example, on my Linux-based machine 
connected to the internet via ethernet, I can get my IP address as a Python variable: 

In [1]: ip_tnfo = ilfconflg wlan0 | grep "Inet " 

In [2]: ip_info[0].strip( ) 

0ut[2] : 'inet addr:10.0.0.11 Bcast:10.0.0.255 Mask:255.255.255.0' 

The returned Python object ip_info is actually a custom list type containing various 
versions of the console output. 

IPython can also substitute in Python values defined in the current environment 
when using !. To do this, preface the variable name by the dollar sign $: 

In [3]: foo = 'test*' 

In [4]: ! Is $foo 
test4.py test.py test.xml 

The %alias magic function can define custom shortcuts for shell commands. As a 
simple example: 

In [1]: Dallas ll Is -l 

In [2]: ll /usr 
total 332 


drwxr-xr-x 

2 

root 

root 

69632 

2012-01-29 

20:36 

bln/ 

drwxr-xr-x 

2 

root 

root 

4096 

2010-08-23 

12:05 

games/ 

drwxr-xr-x 

123 

root 

root 

20480 

2011-12-26 

18:08 

Include/ 

drwxr-xr-x 

265 

root 

root 

126976 

2012-01-29 

20:36 

lib/ 

drwxr-xr-x 

44 

root 

root 

69632 

2011-12-26 

18:08 

llb32/ 

trwxrwxrwx 

1 

root 

root 

3 

2010-08-23 

16:02 

llb64 -> lib/ 

drwxr-xr-x 

15 

root 

root 

4096 

2011-10-13 

19:03 

local/ 

drwxr-xr-x 

2 

root 

root 

12288 

2012-01-12 

09:32 

sbln/ 

drwxr-xr-x 

387 

root 

root 

12288 

2011-11-04 

22:53 

share/ 

drwxrwsr-x 

24 

root 

sre 

4096 

2011-07-17 

18:38 

sre/ 
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You can execute multiple commands just as on the command line by separating them 
with semicolons: 

In [558]: %alias test_allas (cd examples; Is; cd ..) 

In [559]: test_allas 
macrodata.csv spx.csv tlps.csv 

You’ll notice that IPython “forgets” any aliases you define interactively as soon as the 
session is closed. To create permanent aliases, you will need to use the configuration 
system. 

Directory Bookmark System 

IPython has a simple directory bookmarking system to enable you to save aliases for 
common directories so that you can jump around very easily. For example, suppose 
you wanted to create a bookmark that points to the supplementary materials for this 
book: 

In [6]: %bookmark py4da /home/wesm/code/pydata-book 

Once you’ve done this, when we use the %cd magic, we can use any bookmarks we’ve 
defined: 

In [7]: cd py4da 

(bookmark:py4da) -> /home/wesm/code/pydata-book 
/home/wesm/code/pydata-book 

If a bookmark name conflicts with a directory name in your current working direc¬ 
tory, you can use the -b flag to override and use the bookmark location. Using the -1 
option with %bookmark lists all of your bookmarks: 

In [8]: ^bookmark -l 
Current bookmarks: 

py4da -> /home/wesm/code/pydata-book-source 
Bookmarks, unlike aliases, are automatically persisted between IPython sessions. 

B.3 Software Development Tools 

In addition to being a comfortable environment for interactive computing and data 
exploration, IPython can also be a useful companion for general Python software 
development. In data analysis applications, it’s important first to have correct code. 
Fortunately, IPython has closely integrated and enhanced the built-in Python pdb 
debugger. Secondly you want your code to be fast. For this IPython has easy-to-use 
code timing and profiling tools. I will give an overview of these tools in detail here. 
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Interactive Debugger 

IPython’s debugger enhances pdb with tab completion, syntax highlighting, and con¬ 
text for each line in exception tracebacks. One of the best times to debug code is right 
after an error has occurred. The %debug command, when entered immediately after 
an exception, invokes the “post-mortem” debugger and drops you into the stack 
frame where the exception was raised: 

In [2]: run examples/ipython_bug.py 


AssertionError Traceback (most recent call last) 

/hopie/wespi/code/pydata-book/examples/ipython_bug.py in <module>() 

13 throws_an_exception( ) 

14 

—> 15 calling_things( ) 

/home/wespi/code/pydata-book/exariples/ipython_bug.py in calling_things( ) 

11 def calling_things() : 

12 works_fine() 

—> 13 throws_an_exception() 

14 

15 calling_things( ) 

/hopie/wespi/code/pydata-book/examples/ipython_bug. py in throws_an_exception() 

7 a = 5 

8 b = 6 

-> 9 assert(a + b == 10) 

10 

11 def calling_things() : 

AssertionError: 

In [3]: %debug 

> /home/wesm/code/pydata-book/examples/ipython_bug.py(9)throws_an_exception( ) 
8 b = 6 

-> 9 assert(a + b == 10) 

10 


ipdb> 

Once inside the debugger, you can execute arbitrary Python code and explore all of 
the objects and data (which have been “kept alive” by the interpreter) inside each 
stack frame. By default you start in the lowest level, where the error occurred. By 
pressing u (up) and d (down), you can switch between the levels of the stack trace: 

ipdb> u 

> /home/wesm/code/pydata-book/examples/ipython_bug.py(13)calling_things( ) 

12 works_fine() 

—> 13 throws_an_exception() 

14 
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Executing the %pdb command makes it so that IPython automatically invokes the 
debugger after any exception, a mode that many users will find especially useful. 

It’s also easy to use the debugger to help develop code, especially when you wish to set 
breakpoints or step through the execution of a function or script to examine the state 
at each stage. There are several ways to accomplish this. The first is by using %run 
with the - d flag, which invokes the debugger before executing any code in the passed 
script. You must immediately press s (step) to enter the script: 

In [5]: run -d examples/lpython_bug.py 

Breakpoint 1 at /home/wesm/code/pydata-book/examples/lpython_bug.py : 1 
NOTE: Enter 'c' at the tpdb> prompt to start your script. 

> <string>(l)<module>() 

ipdb> s 
--Call-- 

> /home/wesm/code/pydata-book/examples/ipython_bug.py(l)<module>( ) 

1 — > 1 def works_fine() : 

2 a = 5 

3 b = 6 

After this point, it’s up to you how you want to work your way through the file. For 
example, in the preceding exception, we could set a breakpoint right before calling 
the works_fine method and run the script until we reach the breakpoint by pressing 
c (continue): 

ipdb> b 12 
ipdb> c 

> /home/wesm/code/pydata-book/examples/ipython_bug.py(12)catling_things( ) 

11 def calling_things( ): 

2--> 12 works_fine() 

13 throws_an_exceptlon( ) 

At this point, you can step into works_fine() or execute works_f lne() by pressing n 
(next) to advance to the next line: 

ipdb> n 

> /home/wesm/code/pydata-book/examples/ipython_bug.py(13)calling_things( ) 

2 12 works_fine() 

—> 13 throws_an_exception() 

14 

Then, we could step into throws_an_exceptton and advance to the line where the 
error occurs and look at the variables in the scope. Note that debugger commands 
take precedence over variable names; in such cases, preface the variables with ! to 
examine their contents: 

ipdb> s 
- -Call-- 

> /home/wesm/code/pydata-book/examples/ipython_bug.py(6)throws_an_exceptlon( ) 

5 
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-> 6 def throws_an_exception(): 

7 a = 5 

ipdb> n 

> /home/wesm/code/pydata-book/examples/ipython_bug. py(7)throws_an_excepti.on( ) 

6 def throws_an_exceptlon( ): 

----> 7 a = 5 

8 b = 6 

lpdb> n 

> /home/wesm/code/pydata-book/examples/ipython_bug. py(8)throws_an_excepti.on( ) 

7 a = 5 

----> 8 b = 6 

9 assert(a + b == 10) 

lpdb> n 

> /home/wesm/code/pydata-book/examples/ipython_bug.py(9)throws_an_exceptlon( ) 

8 b = 6 

-> 9 assert(a + b == 10) 

10 

tpdb> !a 

5 

tpdb> !b 

6 

Developing proficiency with the interactive debugger is largely a matter of practice 
and experience. See Table B-2 for a full catalog of the debugger commands. If you are 
accustomed to using an IDE, you might find the terminal-driven debugger to be a bit 
unforgiving at first, but that will improve in time. Some of the Python IDEs have 
excellent GUI debuggers, so most users can find something that works for them. 


Table B-2. (I)Python debugger commands 


1 Command 

Action 1 

h(elp) 

Display command list 

help command 

Show documentation for command 

c(ontlnue) 

Resume program execution 

q(uit) 

Exit debugger without executing any more code 

b(reak) number 

Set breakpoint at number in current file 

b path/to/file.py:number 

Set breakpoint at line number in specified file 

s(tep) 

Step into function call 

n(ext) 

Execute current line and advance to next line at current level 

u(p)/d(own) 

Move up/down in function call stack 

a(rgs) 

Show arguments for current function 

debug statement 

Invoke statement statement in new (recursive) debugger 

l(ist) statement 

Show current position and context at current level of stack 

w(here) 

Print full stack trace with context at current position 
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Other ways to make use of the debugger 

There are a couple of other useful ways to invoke the debugger. The first is by using a 
special set_trace function (named after pdb.set_trace), which is basically a “poor 
man’s breakpoint.” Here are two small recipes you might want to put somewhere for 
your general use (potentially adding them to your IPython profile as I do): 

from import Pdb 

def set_trace() : 

Pdb(color_scheme= 1 Linux' ). set_trace(sys._getf rame(). f_back) 

def debug(f, *args, **kwargs): 

pdb = Pdb(color_scheme=' Linux' ) 
return pdb.runcall(f , *args, **kwargs) 

The first function, set_trace, is very simple. You can use a set_trace in any part of 
your code that you want to temporarily stop in order to more closely examine it (e.g., 
right before an exception occurs): 

In [7]: run examples/ipython_bug.py 

> /home/wesm/code/pydata-book/examples/ipython_bug.py(16)catling_things( ) 

15 set_trace() 

—> 16 throws_an_exception( ) 

17 

Pressing c (continue) will cause the code to resume normally with no harm done. 

The debug function we just looked at enables you to invoke the interactive debugger 
easily on an arbitrary function call. Suppose we had written a function like the fol¬ 
lowing and we wished to step through its logic: 

def f(x, y, z=l): 
tmp = x + y 

return tmp / z 

Ordinarily using f would look like f (1 , 2 , z=3) . To instead step into f, pass f as the 
first argument to debug followed by the positional and keyword arguments to be 
passed to f: 

In [6]: debug(f, 1, 2, z=3) 

> <tpython-input>(2)f() 

1 def f (x, y , z) : 

- - --> 2 tmp = x + y 

3 return tmp / z 


tpdb> 

I find that these two simple recipes save me a lot of time on a day-to-day basis. 
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Lastly, the debugger can be used in conjunction with %run. By running a script with 
%run -d, you will be dropped directly into the debugger, ready to set any breakpoints 
and start the script: 

In [1]: %run -d examples/ipython_bug.py 

Breakpoint 1 at /home/wesm/code/pydata-book/examples/ipython_bug.py : 1 
NOTE: Enter 'c' at the tpdb> prompt to start your script. 

> <string>(l)<module>() 

ipdb> 

Adding - b with a line number starts the debugger with a breakpoint set already: 

In [2]: %run -d -b2 examples/ipython_bug.py 

Breakpoint 1 at /home/wesm/code/pydata-book/examples/ipython_bug.py :2 
NOTE: Enter 'c' at the ipdb> prompt to start your script. 

> <string>(l)<module>() 

ipdb> c 

> /home/wesm/code/pydata-book/examples/ipython_bug.py(2)works_fine( ) 

1 def works_fine() : 
l---> 2 a = 5 

3 b = 6 


ipdb> 

Timing Code: %time and %timeit 

For larger-scale or longer-running data analysis applications, you may wish to meas¬ 
ure the execution time of various components or of individual statements or function 
calls. You may want a report of which functions are taking up the most time in a com¬ 
plex process. Fortunately, IPython enables you to get this information very easily 
while you are developing and testing your code. 

Timing code by hand using the built-in tine module and its functions time-clock 
and tine. tine is often tedious and repetitive, as you must write the same uninterest¬ 
ing boilerplate code: 

import time 

start = time.time() 

for i in range(iterations) : 

# solve code to run here 

elapsed_per = (time.timeQ - start) / iterations 

Since this is such a common operation, IPython has two magic functions, %tine and 
%timeit, to automate this process for you. 

%time runs a statement once, reporting the total execution time. Suppose we had a 
large list of strings and we wanted to compare different methods of selecting all 
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strings starting with a particular prefix. Here is a simple list of 600,000 strings and 
two identical methods of selecting only the ones that start with ' foo 1 : 

# a very large list of strings 

strings = ['foo', 'foobar', 'baz', 'qux', 

'python', 'Guido Van Rossum'] * 100000 

methodl = [x for x in strings if x.startswith( 'foo' )] 

method2 = [x for x in strings if x[:3] == ’foo 1 ] 

It looks like they should be about the same performance-wise, right? We can check 
for sure using %tine: 

In [561]: %time methodl = [x for x in strings if x.startswith(' foo' )] 

CPU tines: user 0.19 s, sys: 0.00 s, total: 0.19 s 
Wall tine: 0.19 s 

In [562]: %tine nethod2 = [x for x in strings if x[:3] == 'foo'] 

CPU tines: user 0.09 s, sys: 0.00 s, total: 0.09 s 
Wall tine: 0.09 s 

The Wall time (short for “wall-clock time”) is the main number of interest. So, it 
looks like the first method takes more than twice as long, but it’s not a very precise 
measurement. If you try %time-ing those statements multiple times yourself, you’ll 
find that the results are somewhat variable. To get a more precise measurement, use 
the %timeit magic function. Given an arbitrary statement, it has a heuristic to run a 
statement multiple times to produce a more accurate average runtime: 

In [563]: %tineit [x for x in strings if x.startswith(' foo' )] 

10 loops, best of 3: 159 ns per loop 

In [564]: %tineit [x for x in strings if x[:3] == 'foo'] 

10 loops, best of 3: 59.3 ns per loop 

This seemingly innocuous example illustrates that it is worth understanding the per¬ 
formance characteristics of the Python standard library, NumPy, pandas, and other 
libraries used in this book. In larger-scale data analysis applications, those milli¬ 
seconds will start to add up! 

%timeit is especially useful for analyzing statements and functions with very short 
execution times, even at the level of microseconds (millionths of a second) or nano¬ 
seconds (billionths of a second). These may seem like insignificant amounts of time, 
but of course a 20 microsecond function invoked 1 million times takes 15 seconds 
longer than a 5 microsecond function. In the preceding example, we could very 
directly compare the two string operations to understand their performance 
characteristics: 

In [565]: x = 'foobar' 
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In [566] : y = 'foo 


In [567]: %timeit x.startswi.th(y) 

1000000 loops, best of 3: 267 ns per loop 

In [568]: %timeit x[:3] == y 

10000000 loops, best of 3: 147 ns per loop 

Basic Profiling: %prun and %run -p 

Profiling code is closely related to timing code, except it is concerned with determin¬ 
ing where time is spent. The main Python profiling tool is the cProfile module, 
which is not specific to IPython at all. cProfile executes a program or any arbitrary 
block of code while keeping track of how much time is spent in each function. 

A common way to use cProfile is on the command line, running an entire program 
and outputting the aggregated time per function. Suppose we had a simple script that 
does some linear algebra in a loop (computing the maximum absolute eigenvalues of 
a series of 100 x 100 matrices): 

import rumpy as np 

from import eigvals 

def run_experiment(niter=100) : 

K = 100 

results = [] 

for _ in xrange(niter) : 

mat = np.random.randn(K, K) 
max_eigenvalue = np.abs(eigvals(mat)) .max() 
results.append(max_eigenvalue) 
return results 

some_results = run_experiment() 

print 'Largest one we saw: %s' % np.max(some_results) 

You can run this script through cProfile using the following in the command line: 

python -m cProfile cprof_example.py 

If you try that, you’ll find that the output is sorted by function name. This makes it a 
bit hard to get an idea of where the most time is spent, so it’s very common to specify 
a sort order using the - s flag: 

$ python -m cProfile -s cumulative cprof_example.py 
Largest one we saw: 11.923204422 

15116 function calls (14927 primitive calls) in 0.720 seconds 

Ordered by: cumulative time 

ncalls tottime percall cumtime percall filename:lineno(function) 

1 0.001 0.001 0.721 0.721 cprof_example.py:l(<module>) 

100 0.003 0.000 0.586 0.006 linalg.py : 702 (eigvals) 
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200 

0.572 

0.003 

0.572 

0.003 

{numpy.linalg.lapack_lite.dgeev} 

1 

0.002 

0.002 

0.075 

0.075 

_init _ . py : 106(<module>) 

100 

0.059 

0.001 

0.059 

0.001 

{method 'randn') 

1 

0.000 

0.000 

0.044 

0.044 

add_newdocs.py : 9(<module>) 

2 

0.001 

0.001 

0.037 

0.019 

_init _ . py : l(<module>) 

2 

0.003 

0.002 

0.030 

0.015 

_ init _ .py:2(<module>) 

1 

0.000 

0.000 

0.030 

0.030 

type_check.py:3(<module>) 

1 

0.001 

0.001 

0.021 

0.021 

_ init _ . py : 15(<module>) 

1 

0.013 

0.013 

0.013 

0.013 

numeric.py : l(<module>) 

1 

0.000 

0.000 

0.009 

0.009 

_ init _ .py:6(<module>) 

1 

0.001 

0.001 

0.008 

0.008 

_ init _ . py:45(<module>) 

262 

0.005 

0.000 

0.007 

0.000 

function_base.py : 3178(add_newdoc) 

100 

0.003 

0.000 

0.005 

0.000 

linalg .py: 162 (_assertFinite) 


Only the first 15 rows of the output are shown. It’s easiest to read by scanning down 
the cumtime column to see how much total time was spent inside each function. Note 
that if a function calls some other function, the clock does not stop running. cProfile 
records the start and end time of each function call and uses that to produce the 
timing. 

In addition to the command-line usage, cProfile can also be used programmatically 
to profile arbitrary blocks of code without having to run a new process. IPython has a 
convenient interface to this capability using the %prun command and the -p option to 
%run. %prun takes the same “command-line options” as cProfile but will profile an 
arbitrary Python statement instead of a whole .py file: 

In [4]: %prun -l 7 -s cumulative run_experiment( ) 

4203 function calls in 0.643 seconds 


Ordered by: cumulative time 

List reduced from 32 to 7 due to restriction <7> 


ncalls 

tottime 

percall 

cumtime 

percall filename:lineno(function) 

1 

0.000 

0.000 

0.643 

0.643 <string>:l(<module>) 

1 

0.001 

0.001 

0.643 

0.643 cprof_example.py : 4( run_experiment ) 

100 

0.003 

0.000 

0.583 

0.006 linalg.py:702(eigvals) 

200 

0.569 

0.003 

0.569 

0.003 {numpy.linalg.lapack_lite.dgeev} 

100 

0.058 

0.001 

0.058 

0.001 {method 'randn'} 

100 

0.003 

0.000 

0.005 

0.000 linalg.py : 162 (_assertFinite) 

200 

0.002 

0.000 

0.002 

0.000 {method 'all' of ' numpy.ndarray' } 


Similarly, calling %run -p -s cumulative cprof_example.py has the same effect as 
the command-line approach, except you never have to leave IPython. 

In the Jupyter notebook, you can use the %%prun magic (two % signs) to profile an 
entire code block. This pops up a separate window with the profile output. This can 
be useful in getting possibly quick answers to questions like, “Why did that code 
block take so long to run?” 
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There are other tools available that help make profiles easier to understand when you 
are using IPython or Jupyter. One of these is SnakeViz, which produces an interactive 
visualization of the profile results using d3.js. 

Profiling a Function Line by Line 

In some cases the information you obtain from %prun (or another cProfile-based 
profile method) may not tell the whole story about a functions execution time, or it 
may be so complex that the results, aggregated by function name, are hard to inter¬ 
pret. For this case, there is a small library called line_profiler (obtainable via PyPI 
or one of the package management tools). It contains an IPython extension enabling 
a new magic function %lprun that computes a line-by-line-profiling of one or more 
functions. You can enable this extension by modifying your IPython configuration 
(see the IPython documentation or the section on configuration later in this chapter) 
to include the following line: 

# A list of dotted nodule names of IPython extensions to load. 
c.TeminallPythonApp.extensions = [ 'line_profiler 1 ] 

You can also run the command: 

%load_ext llne_profiler 

line_profiler can be used programmatically (see the full documentation), but it is 
perhaps most powerful when used interactively in IPython. Suppose you had a mod¬ 
ule prof_mod with the following code doing some NumPy array operations: 

from import randn 

def add_and_sum(x, y): 
added = x + y 

summed = added.sum(axts=l) 
return summed 

def calt_function( ): 

x = randn(1000, 1000) 
y = randn(1000, 1000) 
return add_and_sum(x, y) 

If we wanted to understand the performance of the add_and_sum function, %prun 
gives us the following: 

In [569]: %run prof_mod 

In [570]: x = randn(3000, 3000) 

In [571]: y = randn(3000, 3000) 

In [572]: %prun add_and_sum(x, y) 

4 function calls in 0.049 seconds 
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Ordered by: internal tine 


ncalls 

tottine 

percall 

cumtime 

percall filename:lineno(function) 

1 

0.036 

0.036 

0.046 

0.046 profjnod.py:3(add_and_sum) 

1 

0.009 

0.009 

0.009 

0.009 {method 'sum' of 'numpy.ndai 

1 

0.003 

0.003 

0.049 

0.049 <string>:l(<module>) 


This is not especially enlightening. With the line_profiler IPython extension acti¬ 
vated, a new command %lprun is available. The only difference in usage is that we 
must instruct %lprun which function or functions we wish to profile. The general 
syntax is: 

%lprun -f fund -f func2 statement_to_profile 

In this case, we want to profile add_and_sum, so we run: 

In [573]: %lprun -f add_and_sum add_and_sum(x, y) 

Timer unit: le-06 s 
File: profjnod.py 
Function: add_and_sum at line 3 
Total tine: 0.045936 s 


Line # 

Hits 

Time 

Per Hit 

% Tine 

Line Contents 

3 





def add_and_sum(x, y): 

4 

1 

36510 

36510.0 

79.5 

added = x + y 

5 

1 

9425 

9425.0 

20.5 

summed = added.sum(axis=l) 

6 

1 

1 

1.0 

0.0 

return summed 


This can be much easier to interpret. In this case we profiled the same function we 
used in the statement. Looking at the preceding module code, we could call 
call_function and profile that as well as add_and_sum, thus getting a full picture of 
the performance of the code: 

In [574]: %lprun -f add_and_sum -f call_function call_function( ) 

Timer unit: le-06 s 
File: profjnod.py 


Function : 

add and sum at 

: line 

3 



Total time 

: 0.005526 s 





Line # 

Hits 

Time 

Per Hit 

% Time 

Line Contents 

3 





def add_and_sum(x, y): 

4 

1 

4375 

4375.0 

79.2 

added = x + y 

5 

1 

1149 

1149.0 

20.8 

summed = added.sum(axis=l) 

6 

1 

2 

2.0 

0.0 

return summed 

File: profjnod.py 





Function : 

call_function 

at line 8 



Total time 

: 0.121016 s 





Line # 

Hi ts 

Time 

Per Hit 

% Time 

Line Contents 

8 





def call_function( ): 

9 

1 

57169 

57169.0 

47.2 

x = randn(1000, 1000) 

10 

1 

58304 

58304.0 

48.2 

y = randn(1000, 1000) 

11 

1 

5543 

5543.0 

4.6 

return add_and_sum(x, y) 


More on the IPython System | 497 









As a general rule of thumb, I tend to prefer %prun (cProfile) for “macro” profiling 
and %lprun (line_profiler) for “micro” profiling. It’s worthwhile to have a good 
understanding of both tools. 



The reason that you must explicitly specify the names of the func¬ 
tions you want to profile with %lprun is that the overhead of “trac¬ 
ing” the execution time of each line is substantial. Tracing 
functions that are not of interest has the potential to significantly 
alter the profile results. 


B.4 Tips for Productive Code Development Using IPython 

Writing code in a way that makes it easy to develop, debug, and ultimately use inter¬ 
actively may be a paradigm shift for many users. There are procedural details like 
code reloading that may require some adjustment as well as coding style concerns. 

Therefore, implementing most of the strategies described in this section is more of an 
art than a science and will require some experimentation on your part to determine a 
way to write your Python code that is effective for you. Ultimately you want to struc¬ 
ture your code in a way that makes it easy to use iteratively and to be able to explore 
the results of running a program or function as effortlessly as possible. I have found 
software designed with IPython in mind to be easier to work with than code intended 
only to be run as as standalone command-line application. This becomes especially 
important when something goes wrong and you have to diagnose an error in code 
that you or someone else might have written months or years beforehand. 

Reloading Module Dependencies 

In Python, when you type import some_lib, the code in somejlib is executed and all 
the variables, functions, and imports defined within are stored in the newly created 
some_lib module namespace. The next time you type import somejlib, you will get 
a reference to the existing module namespace. The potential difficulty in interactive 
IPython code development comes when you, say, %run a script that depends on some 
other module where you may have made changes. Suppose I had the following code 
in test_script.py: 

import some_lib 

x = 5 

y = [1, 2, 3, 4] 

result = some_lib.get_answer(x, y) 

If you were to execute %run test_script. py then modify some_lib.py, the next time 
you execute %run test_script.py you will still get the old version of some_lib.py 
because of Python’s “load-once” module system. This behavior differs from some 
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other data analysis environments, like MATLAB, which automatically propagate code 
changes. 1 To cope with this, you have a couple of options. The first way is to use the 
reload function in the importlib module in the standard library: 

import some_lib 
import importlib 

importlib.reload(some_lib) 

This guarantees that you will get a fresh copy of some_lib.py every time you run 
test_script.py. Obviously, if the dependencies go deeper, it might be a bit tricky to be 
inserting usages of reload all over the place. For this problem, IPython has a special 
d reload function ( not a magic function) for “deep” (recursive) reloading of modules. 
If I were to run some_lib.py then type dreload(some_lib), it will attempt to reload 
some_lib as well as all of its dependencies. This will not work in all cases, unfortu¬ 
nately, but when it does it beats having to restart IPython. 

Code Design Tips 

There’s no simple recipe for this, but here are some high-level principles I have found 
effective in my own work. 

Keep relevant objects and data alive 

It’s not unusual to see a program written for the command line with a structure some¬ 
what like the following trivial example: 

from import g 

def f (x, y): 

return g(x + y) 

def main() : 
x = 6 
y = 7.5 

result = x + y 

if _name _ == ' _ main _ ' : 

main( ) 

Do you see what might go wrong if we were to run this program in IPython? After it’s 
done, none of the results or objects defined in the main function will be accessible in 
the IPython shell. A better way is to have whatever code is in main execute directly in 
the module’s global namespace (or in the if_name_== '_main_': block, if you 


1 Since a module or package may be imported in many different places in a particular program, Python caches a 
modules code the first time it is imported rather than executing the code in the module every time. Other¬ 
wise, modularity and good code organization could potentially cause inefficiency in an application. 
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want the module to also be importable). That way, when you %run the code, you’ll be 
able to look at all of the variables defined in main. This is equivalent to defining top- 
level variables in cells in the Jupyter notebook. 

Flat is better than nested 

Deeply nested code makes me think about the many layers of an onion. When testing 
or debugging a function, how many layers of the onion must you peel back in order 
to reach the code of interest? The idea that “flat is better than nested” is a part of the 
Zen of Python, and it applies generally to developing code for interactive use as well. 
Making functions and classes as decoupled and modular as possible makes them eas¬ 
ier to test (if you are writing unit tests), debug, and use interactively. 

Overcome a fear of longer files 

If you come from a Java (or another such language) background, you may have been 
told to keep files short. In many languages, this is sound advice; long length is usually 
a bad “code smell,” indicating refactoring or reorganization may be necessary. How¬ 
ever, while developing code using IPython, working with 10 small but interconnected 
files (under, say, 100 lines each) is likely to cause you more headaches in general than 
two or three longer files. Fewer files means fewer modules to reload and less jumping 
between files while editing, too. I have found maintaining larger modules, each with 
high internal cohesion, to be much more useful and Pythonic. After iterating toward 
a solution, it sometimes will make sense to refactor larger files into smaller ones. 

Obviously, I don’t support taking this argument to the extreme, which would to be to 
put all of your code in a single monstrous file. Finding a sensible and intuitive mod¬ 
ule and package structure for a large codebase often takes a bit of work, but it is espe¬ 
cially important to get right in teams. Each module should be internally cohesive, and 
it should be as obvious as possible where to find functions and classes responsible for 
each area of functionality. 

B.5 Advanced IPython Features 

Making full use of the IPython system may lead you to write your code in a slightly 
different way, or to dig into the configuration. 

Making Your Own Classes IPython-Friendly 

IPython makes every effort to display a console-friendly string representation of any 
object that you inspect. For many objects, like diets, lists, and tuples, the built-in 
pprint module is used to do the nice formatting. In user-defined classes, however, 
you have to generate the desired string output yourself. Suppose we had the following 
simple class: 
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class Message: 

def _init_ (self, nsg): 

self.msg = msg 

If you wrote this, you would be disappointed to discover that the default output for 
your class isn’t very nice: 

In [576]: x = Message( 'I have a secret') 

In [577]: x 

0ut[577]: <_ main_.Message instance at 0x60ebbd8> 

IPython takes the string returned by the_repr_magic method (by doing output = 

repr(obj)) and prints that to the console. Thus, we can add a simple_repr_ 

method to the preceding class to get a more helpful output: 

class Message: 

def init (self, nsg): 

self. nsg = nsg 

def repr (self): 

return 'Message: %s' % self. nsg 

In [579]: x = Message( 'I have a secret') 

In [580]: x 

Out[580]: Message: I have a secret 

Profiles and Configuration 

Most aspects of the appearance (colors, prompt, spacing between lines, etc.) and 
behavior of the IPython and Jupyter environments are configurable through an 
extensive configuration system. Here are some things you can do via configuration: 

• Change the color scheme 

• Change how the input and output prompts look, or remove the blank line after 
Out and before the next In prompt 

• Execute an arbitrary list of Python statements (e.g., imports that you use all the 
time or anything else you want to happen each time you launch IPython) 

• Enable always-on IPython extensions, like the %lprun magic in line_profiler 

• Enabling Jupyter extensions 

• Define your own magics or system aliases 

Configurations for the IPython shell are specified in special ipython_config.py files, 
which are usually found in the .ipython/ directory in your user home directory. Con¬ 
figuration is performed based on a particular profile. When you start IPython nor¬ 
mally, you load up, by default, the default profile, stored in the pro file _default 
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directory. Thus, on my Linux OS the full path to my default IPython configuration 
file is: 

/home/wesm/.ipython/profile_default/lpython_conftg.py 

To initialize this file on your system, run in the terminal: 

Ipython profile create 

I’ll spare you the gory details of what’s in this file. Fortunately it has comments 
describing what each configuration option is for, so I will leave it to the reader to 
tinker and customize. One additional useful feature is that it’s possible to have multi¬ 
ple profiles. Suppose you wanted to have an alternative IPython configuration tailored 
for a particular application or project. Creating a new profile is as simple as typing 
something like the following: 

Ipython profile create secret_project 

Once you’ve done this, edit the config files in the newly created profile_secret_project 
directory and then launch IPython like so: 

$ Ipython --profile=secret_project 

Python 3.5.1 | packaged by conda-forge | (default. May 20 2016, 05:22:56) 

Type "copyright", "credits" or "license" for more information. 

IPython 5.1.0 -- An enhanced Interactive Python. 

? -> Introduction and overview of IPython's features. 

%qutckref -> Quick reference. 

help -> Python's own help system. 

object? -> Details about 'object', use 'object??' for extra details. 

IPython profile: secret_project 

As always, the online IPython documentation is an excellent resource for more on 
profiles and configuration. 

Configuration for Jupyter works a little differently because you can use its notebooks 
with languages other than Python. To create an analogous Jupyter config file, run: 

jupyter notebook --generate-config 

This writes a default config file to the .jupyter/jupyter_notebook_config.py directory in 
your home directory. After editing this to suit your needs, you may rename it to a 
different file, like: 

$ mv -/.jupyter/jupyter_notebook_config.py ~/.jupyter/my_custom_config.py 
When launching Jupyter, you can then add the - - config argument: 
jupyter notebook --config=~/.jupyter/my_custom_config.py 
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B.6 Conclusion 

As you work through the code examples in this book and grow your skills as a Python 
programmer, I encourage you to keep learning about the IPython and Jupyter ecosys¬ 
tems. Since these projects have been designed to assist user productivity, you may dis¬ 
cover tools that enable you to do your work more easily than using the Python 
language and its computational libraries by themselves. 

You can also find a wealth of interesting Jupyter notebooks on the nbviewer website. 
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for ndarrays, 89, 453, 463, 481 
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barh method, 272 
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Cartesian product, 77, 230 
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cat method, 218 
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%cd magic function, 485, 487 
ceil function, 107 
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clipboard, executing code from, 26 
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closed attribute, 83 
!cmd command, 485 
collections module, 64 
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columns method, 315 
column_stack function, 456 
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combining data (see merging data) 
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input and output variables, 484 
reusing, 483 
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using in IPython, 483-485 
commands 
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magic functions, 28-29 
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comments in Python, 31 
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complex256 data type, 91 
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concat function, 227, 235, 237-241, 300 
concatenate function, 236, 454 
concatenating 

along an axis, 227, 236-241 
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strings, 41 

conda update command, 10 
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copysign function, 107 
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corr method, 161 
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cov method, 161 

covariance, 160-162 

%cpaste magic function, 26, 29 

cProflle module, 494-496 

cross-tabulation, 315 

crosstab function, 316 

cross_val_score function, 401 

CSV files, 168, 175-178 

csv module, 176 

Ctrl-A keyboard shortcut, 27 

Ctrl-B keyboard shortcut, 27 

Ctrl-C keyboard shortcut, 26, 27 

Ctrl-D keyboard shortcut, 16 

Ctrl-E keyboard shortcut, 27 

Ctrl-F keyboard shortcut, 27 

Ctrl-K keyboard shortcut, 27 

Ctrl-L keyboard shortcut, 27 

Ctrl-N keyboard shortcut, 27, 484 

Ctrl-P keyboard shortcut, 27, 483 

Ctrl-R keyboard shortcut, 27, 484 

Ctrl-Shift-V keyboard shortcut, 27 

Ctrl-U keyboard shortcut, 27 

cummax method, 160 

cummin method, 160 

cumprod method, 112, 160 

cumsum method, 112, 160, 466 

curly braces {}, 61, 65 

currying, 74 

cut function, 203, 305 

c_ object, 456 
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D 

%d datetime format, 46, 319 
%D datetime format, 46, 320 
d(own) debugger command, 490 
data aggregation 
about, 296 

column-wise, 298-301 
multiple function application, 298-301 
returning data without row indexes, 301 
data alignment, pandas library and, 146-151 
data analysis with Python 
about, 2, 15-16 
glue code, 2 

MovieLens 1M dataset example, 413-419 
restrictions to consider, 3 
US baby names dataset example, 419-434 
US Federal Election Commission database 
example, 440-448 

USA.gov data from Bitly example, 403-413 
USDA food database example, 434-439 
“two-language” problem, 3 
data cleaning and preparation (see data wran¬ 
gling) 

data loading (see reading data) 
data manipulation (see data wrangling) 
data munging (see data wrangling) 
data selection 

for axis indexes with duplicate labels, 157 
in pandas library, 140-145 
time series data, 323 
data structures 
about, 51 

diet comprehensions, 67 
diets, 61-65 

for pandas library, 124-136 
list comprehensions, 67-69 
lists, 54-59 

set comprehensions, 68 
sets, 65-67 
tuples, 51-54 

data transformation (see transforming data) 
data types 

attributes for, 469 
defined, 90, 449 
for date and time data, 318 
for ndarrays, 90-93 
in Python, 38-46 
nested, 469 

NumPy hierarchy, 450 


parent classes of, 450 
data wrangling 

combining and merging datasets, 227-242 
defined, 14 

handling missing data, 191-197 
hierarchical indexing, 221-226, 243 
pivoting data, 246-250 
reshaping data, 243 
string manipulation, 211-219 
transforming data, 197-211 
working with delimited formats, 176-178 
databases 

DataFrame joins, 227-232 
pandas interacting with, 188 
storing data in, 247 
DataFrame data structure 
about, 4, 128-134, 470 
database-stye joins, 227-232 
indexing with columns, 225 
JSON data and, 180 
operations between Series and, 149 
optional function arguments, 168 
plot method arguments, 271 
possible data inputs to, 134 
ranking data in, 155 
sorting considerations, 153, 473 
summary statistics methods for, 161 
DataOffset object, 338 
datasets 

combining and merging, 227-242 
MovieLens 1M example, 413-419 
US baby names example, 419-434 
US Federal Election Commission database 
example, 440-448 

USA.gov data from Bitly example, 403-413 
USDA food database example, 434-439 
date data type, 44, 319 
date offsets, 330, 333-334 
date ranges, generating, 328-330 
dates and times 
about, 44 

converting between strings and datetime, 
319-321 

data types and tools, 318 
formatting specifications, 319, 321 
generating date ranges, 328-330 
period arithmetic and, 339-347 
datetime data type 
about, 44, 318-319 
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converting between strings and, 319-321 
format specification for, 319 
datetime module, 44, 318 
datetime64 data type, 322 
Datetimelndex class, 322, 328, 337 
dateutil package, 320 
date_range function, 328-330 
daylight saving time (DST), 335 
debug function, 491 
%debug magic function, 80, 488 
debugger, IPython, 488-492 
decode method, 42 
def keyword, 69, 74 
default values for diets, 63 
defaultdict class, 64 
del keyword, 62, 132 
del method, 132 
delete method, 136 

delimited formats, working with, 176-178 

dense method, 156 

density plots, 277-279 

deque (double-ended queue), 55 

describe method, 160, 297 

design matrix, 386 

det function, 117 

development tools for IPython (see software 
development tools for IPython) 

%dhist magic function, 486 
diag function, 117 
Dialect class, 177 
diet comprehensions, 67 
diet function, 63 

dictionary-encoded representation, 365 
diets (data structures) 
about, 61 

creating from sequences, 63 
DataFrame data structure as, 129 
default values, 63 
grouping with, 294 
Series data structure as, 125 
valid key types, 64 
diff method, 160 
difference method, 66, 136 
difference_update method, 66 
dimension tables, 364 
directories, bookmarking in IPython, 487 
%dirs magic function, 485 
discretization, 203 
distplot method, 279 


div method, 149 
divide function, 107 
divmod function, 106 
dmatrices function, 386 
dnorm function, 394 
dot function, 104, 116-117 
downsampling, 348, 349-351 
dreload function, 499 
drop method, 136, 138 
dropna method, 192-193, 306, 315 
drop_duplicates method, 197 
DST (daylight saving time), 335 
dstack function, 456 
dtype (see data types) 
dtype attribute, 88, 92 
duck typing, 35 

dummy variables, 208-211, 372, 386, 391 
dumps function, 179 
duplicate data 

axis indexes with duplicate labels, 157 

removing, 197 

time series with duplicate indexes, 326 
duplicated method, 197 
dynamic references in Python, 33 

E 

edit-compile-run workflow, 6 

education, continuing, 401 

eig function, 118 

elif statement, 46 

else statement, 46 

empty function, 89-90 

empty namespace, 25 

empty_like function, 90 

encode method, 42 

end-of-line (EOL) markers, 80 

endswith method, 213, 218 

enumerate function, 59 

%env magic function, 486 

EOL (end-of-line) markers, 80 

equal function, 108 

error handling in Python, 77-80 

escape characters, 41 

ewm function, 358 

Excel files (Microsoft), 186-187 

ExcelFile class, 186 

exception handling in Python, 77-80 

exclamation point (!), 486 

execute-explore workflow, 6 
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exit command, 16 

exp function, 107 

expanding function, 356 

exponentially-weighted functions, 358 

extend method, 56 

extract method, 218 

eye function, 90 

F 

%F datetime format, 46, 320 
fabs function, 107 
facet grids, 283 
FacetGrid class, 285 
factorplot built-in function, 283 
fancy indexing, 102, 459 
FDIC bank failures list, 180 
Feather binary file format, 168, 184 
feature engineering, 383 
Federal Election Commission database exam¬ 
ple, 440-448 
Figure object, 255 
file management 

binary data formats, 183-187 
commonly used file methods, 82 
design tips, 500 

file input and output with arrays, 115 
JSON data, 178-180 
memory-mapped files, 478 
opening files, 80 
Python file modes, 82 
reading and writing data in text format, 
167-176 

saving plots to files, 267 
Web scraping, 180-183 
working with delimited formats, 176-178 
filling in data 

arithmetic methods with fill values, 148 
filling in missing data, 195-197, 200 
with group-specific values, 306 
fillna method, 192, 195-197, 200, 306, 352 
fill_value method, 315 
filtering 

in pandas library, 140-145 
missing data, 193 
outliers, 205 
find method, 212-213 
findall method, 214, 216, 218 
finditer method, 216 
first method, 156, 296 


fit method, 395, 400 
fixed frequency, 317 
flags attribute, 481 
flatten method, 453 
float data type, 39, 43 
float function, 43 
floatl28 data type, 91 
float 16 data type, 91 
float32 data type, 91 
float64 data type, 91 
floor function, 107 
floordiv method, 149 
floor_divide function, 107 
flow control in Python, 46-50 
flush method, 83, 479 
fmax function, 107 
fmin function, 107 
for loops, 47, 68 
format method, 41 
formatting 

dates and times, 319, 321 
strings, 41 

Fortran order (column major order), 454, 481 
frequencies 
base, 330 

basic for time series, 329 
converting between, 327, 348-354 
date offsets and, 330 
fixed, 317 

period conversion, 340 
quarterly period frequencies, 342 
fromfile function, 471 
frompyfunc function, 468 
from_codes method, 367 
full function, 90 
full_like function, 90 
functions, 69 

(see also universal functions) 
about, 69 

accessing variables, 70 
anonymous, 73 
as objects, 72-73 
currying, 74 

errors and exception handling, 77 
exponentially-weighted, 358 
generators and, 75-80 
grouping with, 295 
in Python, 32 
lambda, 73 
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magic, 28-29 
namespaces and, 70 
object introspection, 23 
partial argument application, 74 
profiling line by line, 496-498 
returning multiple values, 71 
sequence, 59-61 
transforming data using, 198 
type inference in, 168 

writing fast NumPy functions with Numba, 
476-478 

functools module, 74 

G 

gamma function, 119 
generators 
about, 75 

generator expressions for, 76 
itertools module and, 76 
get method, 63, 218 
GET request (HTTP), 187 
getattr function, 35 
getroot method, 182 
get_chunk method, 175 
get_dummies function, 208, 372, 385 
get_indexer method, 164 
get_value method, 145 
GIL (global interpreter lock), 3 
global keyword, 71 
glue for code, Python as, 2 
greater function, 108 
greater_equal function, 108 
Greenwich Mean Time, 335 
group keys, suppressing, 304 
group operations 
about, 287, 373 
cross-tabulation, 315 
data aggregation, 296-302 
GroupBy mechanics, 288-296 
pivot tables, 287, 313-316 
split-apply-combine, 288, 302-312 
unwrapped, 376 
group weighted average, 310 
groupby function, 77 
groupby method, 368, 476 
GroupBy object 
about, 288-291 
grouping by index level, 295 
grouping with diets, 294 


grouping with functions, 295 
grouping with Series, 294 
iterating over groups, 291 
optimized methods, 296 
selecting columns, 293 
selecting subset of columns, 293 
groups method, 215 

H 

%H datetime format, 46, 319 
h(elp) debugger command, 490 
hasattr function, 35 
hash function, 64 
hash maps (see diets) 
hash mark (#), 31 
hashability, 64 

HDF5 (hierarchical data format 5), 184-186, 
480 

HDFStore class, 184 
head method, 129 
heapsort method, 474 
hierarchical data format (HDF5), 480 
hierarchical indexing 
about, 221-224 
in pandas, 170 

reordering and sorting levels, 224 
reshaping data with, 243 
summary statistics by level, 225 
with DataFrame columns, 225 
%hist magic function, 29 
hist method, 277 
histograms, 277-279 
hsplit function, 456 
hstack function, 455 
HTML files, 180-183 
HTTP requests, 187 
Hugunin, Jim, 86 
Hunter, John D., 5, 253 

I 

%I datetime format, 46, 319 
identity function, 90 

IDEs (Integrated Development Environments), 
11 

idxmax method, 160 
idxmin method, 160 
if statement, 46 
iloc operator, 143, 207 
immutable objects, 38, 367 
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import conventions 
for matplotlib, 253 
for modules, 14, 36 
for Python, 14, 36, 88 
importlib module, 499 
imshow function, 109 
in keyword, 56, 212 
in-place sorts, 57, 471 
inld method, 114, 115 
indentation in Python, 30 
index method, 212-213, 315 
Index objects, 134-136 
indexes and indexing 

axis indexes with duplicate labels, 157 
boolean indexing, 99-102 
fancy indexing, 102, 459 
for ndarrays, 94-98 
for pandas library, 140-145, 157 
grouping by index level, 295 
hierarchical indexing, 170, 221-226, 243 
Index objects, 134-136 
integer indexing, 145 
merging on index, 232-235 
renaming axis indexes, 201 
time series data, 323 
time series with duplicate indexes, 326 
timedeltas and, 318 
indexing operator, 58 
indicator variables, 208-211 
indirect sorts, 472 
inner join type, 229 
input variables, 484 
insert method, 55, 136 
insort function, 57 
int data type, 39, 43 
int function, 43 
int 16 data type, 91 
int32 data type, 91 
int64 data type, 91 
int8 data type, 91 
integer arrays, indexing, 102, 459 
integer indexing, 145 

Integrated Development Environments (IDEs), 
11 

interactive debugger, 488-492 
interpreted languages, 2, 16 
interrupting running code, 26 
intersect Id method, 115 
intersection method, 65-66, 136 


intersection_update method, 66 
intervals of time, 317 
inv function, 118 
.ipynb file extension, 20 
IPython 

%run command and, 17 
%run command in, 25-26 
about, 6 

advanced features, 500-502 
bookmarking directories, 487 
code development tips, 498-500 
command history in, 483-485 
exception handling in, 79 
executing code from clipboard, 26 
figures and subplots, 255 
interacting with operating system, 485-487 
keyboard shortcuts for, 27 
magic commands in, 28-29 
matplotlib integration, 29 
object introspection, 23-24 
running Jupyter notebook, 18-20 
running shell, 17-18 
shell commands in, 486 
software development tools, 487-498 
tab completion in, 21-23 
ipython command, 17-18 
is keyword, 38 
is not keyword, 38 
isalnum method, 218 
isalpha method, 218 
isdecimal method, 218 
isdigit method, 218 
isdisjoint method, 66 
isfinite function, 107 
isin method, 136, 163 
isinf function, 107 
isinstance function, 34 
islower method, 218 
isnan function, 107 
isnull method, 126, 192 
isnumeric method, 218 
issubdtype function, 450 
issubset method, 66 
issuperset method, 66 
isupper method, 218 
is_monotonic property, 136 
is_unique property, 136, 157, 326 
iter function, 35 
_iter_magic method, 35 
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iterator protocol, 35, 75-77 
itertools module, 76 

J 

jit function, 477 
join method, 212-213, 218, 235 
join operations, 227-232 
JSON (JavaScript Object Notation), 178-180, 
403 

json method, 187 
Jupyter notebook 

%load magic function, 25 
about, 6 

plotting nuances, 256 
running, 18-20 

jupyter notebook command, 19 

K 

KDE (kernel density estimate) plots, 278 

kernels, defined, 6, 18 

key-value pairs, 61 

keyboard shortcuts for IPython, 27 

Keyboardlnterrupt exception, 26 

KeyError exception, 66 

keys method, 62 

keyword arguments, 32, 70 

kurt method, 160 

L 

l(ist) debugger command, 490 
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axis indexes with duplicate labels, 157 
selecting in matplotlib, 261-263 
lagging data, 332 

lambda (anonymous) functions, 73 
language semantics for Python 
about, 30 
attributes, 35 

binary operators and comparisons, 36, 65 
comments, 31 
duck typing, 35 

function and object method calls, 32 
import conventions, 36 
indentation not braces, 30 
methods, 35 

mutable and immutable objects, 38 
object model, 31 
references, 32-34 


strongly typed language, 33 
variables and argument passing, 32 
last method, 296 
leading data, 332 
left join type, 229 
legend method, 264 
legend selection in matplotlib, 261-265 
len function, 295 
len method, 218 
less function, 108 
less_equal function, 108 
level keyword, 296 
level method, 159 
levels 

grouping by index levels, 295 
sorting, 224 

summary statistics by, 225 
lexsort method, 473 
libraries (see specific libraries) 
line plots, 269-271 

line style selection in matplotlib, 260 
linear algebra, 116-118 
linear regression, 312, 393-396 
Linux, setting up Python on, 9 
list comprehensions, 67-69 
list function, 37, 54 
lists (data structures) 
about, 54 

adding and removing elements, 55 
combining, 56 
concatenating, 56 
maintaining sorted lists, 57 
slicing, 58 
sorting, 57 

lists (data structures)binary searches, 57 

ljust method, 213 

load function, 115, 478 

%load magic function, 25 

loads function, 179 

loc operator, 130, 143, 265, 385 

local namespace, 70, 123 

localizing data to time zones, 335 

log function, 107 

loglO function, 107 

loglp function, 107 

log2 function, 107 

logical_and function, 108, 466 

logical_not function, 107 

logical_or function, 108 
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logical_xor function, 108 
LogisticRegression class, 399 
LogisticRegressionCV class, 400 
long format, 246 
lower method, 199, 213, 218 
%lprun magic function, 496 
lstrip method, 213, 219 
lstsq function, 118 
lxml library, 180-183 
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%m datetime format, 46, 319 
%M datetime format, 46, 319 
mad method, 160 
magic functions, 28-29 

(see also specific magic functions) 
%debug magic function, 29 
%magic magic function, 29 
many-to-many merge, 229 
many-to-one join, 228 
map built-in function, 68, 73 
map method, 153, 199, 202 
mapping 

transforming data using, 198 
universal functions, 151-156 
margins method, 315 
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marker selection in matplotlib, 260 
match method, 164, 214, 216, 219 
Math Kernel Library (MKL), 117 
matplotlib library 
about, 5, 253 
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configuring, 268 
creating image plots, 109 
figures in, 255-259 
import convention, 253 
integration with IPython, 29 
label selection in, 261-263 
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line style selection in, 260 
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saving plots to files, 267 
subplots in, 255-259, 265-267 
tick mark selection in, 261-263 
%matplotlib magic function, 30, 486 
matrix operations in NumPy, 104, 116 
max method, 112, 156, 160, 296 


maximum function, 107 
mean method, 112, 160, 289, 296 
median method, 160, 296 
melt method, 249 
memmap object, 478 
memory management 

C versus Fortran order, 454 
continguous memory, 480-482 
NumPy-based algorithms and, 87 
memory-mapped files, 478 
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read_csv function, 80, 167, 172, 274, 298 
read_excel function, 167, 186 
read_feather function, 168 
read_fwf function, 167 
read_hdf function, 167, 185 
read_html function, 167, 180-183 
read_json function, 167, 179 
read_msgpack function, 167 
read_pickle function, 167, 183 
read_sas function, 168 
read_sql function, 168, 190 
read_stata function, 168 
read_table function, 167, 172, 176 
reduce method, 466 
reduceat method, 467 
reductions (aggregations), 111 
references in Python, 32-34 
regplot method, 281 
regress function, 312 
regular expressions 

passes as delimiters, 171 
string manipulation and, 213-216 
reindex method, 136-138, 145, 157, 352 
reload function, 499 
remove method, 56, 66 
remove_categories method, 372 
remove_unused_categories method, 372 
rename method, 202 
rename_categories method, 372 
reorder_categories method, 372 
repeat function, 457 
repeat method, 219 
replace method, 200, 212-213, 219 
requests package, 187 
resample method, 327, 348-351, 377 
resampling 
defined, 348 

downsampling and, 348-351 
OHLC, 351 

upsampling and, 348, 352 
with periods, 353 
%reset magic function, 29, 485 
reset_index method, 250, 302 


reshape method, 103, 452 
*rest syntax, 54 
return statement, 69 
reusing command history, 483 
reversed function, 61 
rfind method, 213 
rfloordiv method, 149 
right join type, 229 
rint function, 107 
rjust method, 213 
rmul method, 149 
rollback method, 333 
rollforward method, 333 
rolling function, 355, 357 
rolling_corr function, 360 
row major order (C order), 454, 481 
row_stack function, 456 
rpow method, 149 
rstrip method, 213, 219 
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